From 47% to 80%, Ctrip Hotel APP Fluency Improvement Practice

Author: Jin, senior R&D manager at Ctrip, focusing on mobile technology development; Dan, test development manager at Ctrip, focusing on data mining and the application of data in improving system quality; Lanbo, software technology expert at Ctrip, focusing on mobile technology development.

1. Background

APP performance improvement has always been a permanent theme for the R&D team. In the practice of APP performance optimization, in addition to the performance technology solution itself, there are two other problems: First, APP performance optimization is not sustainable. Often after a period of optimization practice, the effect is obvious, but with subsequent demand iterations and code changes, it is difficult for APP performance to maintain a good level; second, there is a lack of a set of scientific quantitative methods to measure the improvement of APP performance.

Quoting the words of management guru Peter Drucker: If you can't measure it, you can't improve it. Based on this, the Ctrip Hotel front-end APP team conducted in-depth thinking and exploration, hoping to continuously improve APP performance and user experience through quantification, governance, and monitoring.

2. Definition of Fluency Index

Fluency, in simple terms, is a measure of the user experience of using an APP. It is an experience indicator for users to use the APP quickly and without obstacles. It mainly includes three aspects: stability, speed, and quality. Stability means that when a user opens a specific page, there is no white screen, crash, flashing, etc. Fast means that the page opens very quickly, and the user's operation is smooth and natural when interacting on the page. Quality means that when browsing the page, there are no unwarranted pop-up windows to interrupt the user's operation. As shown in the figure below:

Based on the above theoretical foundation, white screen, crash, slow loading, freeze, flicker, and error in APP are all factors that cause unsmoothness in users' perception. Therefore, we proposed a quantitative index of fluency rate, and defined the sum of user page PV and the number of secondary loading times triggered by users on the page as the denominator of fluency rate, that is, the total number of samples, as follows:

Sample size = page pv + secondary load number

The number of PVs with slow page loading/page freeze/image/video slow loading after deduplication, plus the number of abnormal situations such as page crash, slide freeze, image/video loading failure, global pop-up window error, input out of focus, button click invalid, secondary loading failure, and secondary loading slowness is defined as the number of unsmooth factors. Then the formula for the smooth rate is defined as:

Fluency rate = (sample size - number of non-fluency factors) / sample size

2.1 Page interactive loading time

The interactive loading time of a page is the page rendering time plus the request response time of the network service, which can be simply expressed by the following formula:

Page interactive loading time (TTI) = page local rendering time + service network loading time

2.2 Principles of collecting interactive loading time of pages

Our core pages all contain Text controls, which can determine the user's interactive time by scanning the text in a specific area of the page. Our technology stack is generally divided into Flutter and Ctrip React Native. The following introduces the principles of loading time collection.

2.2.1 Principle of collecting interactive loading time of Flutter pages

In Flutter, the final UI tree is actually composed of independent Element nodes. The general process from UI creation to rendering is as follows:

Generate an Element based on the Widget, then create a corresponding RenderObject and associate it with the Element.renderObject property, and finally complete the layout and drawing through the RenderObject. As shown in the following figure:

Therefore, you can traverse the elements from the root node until you find the Text component in the scan window and the content of the component is not empty, which means that the page TTI detection is successful. Flutter provides the following interfaces to support element traversal:

 voidvisitChildElements ( ElementVisitor visitor )

2.2.2 Ctrip React Native page interactive loading time collection principle

We know that ReactNative is ultimately rendered by Native components. In iOS/Android, we can obtain the content in the page by recursively searching for the Text control from the root View in the View tree. Excluding the fixed static display area at the top and the static display area at the bottom of the page, if the number of scanned texts is greater than 1, we consider that the page TTI detection is successful.

2.3 Rendering Stuttering and Frame Rate

Google defines lag as follows: UI rendering refers to the action of generating frames from an application and displaying them on the screen. To ensure that users can interact with the application smoothly, the application should take no more than 16ms to render each frame to achieve a rendering speed of 60 frames per second. If an application has a slow UI rendering problem, the system will have to skip some frames, which will cause the user to feel that the application is not smooth. We call this situation lag.

2.3.1 Jamming Standards

To judge whether an app is stuck, you should start from whether the app type is a general application or a game application. Different types of apps have different stuck standards. For general applications, you can refer to Google's Android Vitals performance indicators, and for games, you can refer to Tencent's PrefDog performance indicators. Because our app is a general application, we will briefly introduce the definition of stuck in Google Vitals.

Google Vitals divides jams into two categories:

The first category is slow rendering speed: on pages with a large number of slow rendering frames, when more than 50% of the frame rendering time exceeds 16ms, users will feel obvious lag.

The second type is frame freezing: the drawing time of frame freezing exceeds 700ms, which is a serious lag problem.

In addition, it should be noted that there is no necessary relationship between high and low FPS and lag. A high FPS does not reflect smoothness or no lag. For example, if the FPS is 50, one frame is rendered in the first 200ms and 49 frames are rendered in the last 800ms. Although the FPS is 50, it still feels very lag. At the same time, a low FPS does not mean lag. For example, the average FPS is 15 when there is no lag.

2.3.2 Stuttering Quantification

After understanding the standards and principles of lag, we can conclude that only frame loss can accurately determine whether there is lag.

Flutter officially provides a set of real-time frame data monitoring based on SchedulerBinding.addTimingsCallback callback. When a view drawing refreshes on a flutter page, the system spits out a string of FrameTiming data. The data structure of FrameTiming is as follows:

 vsyncStart ,
 buildStart ,
 buildFinish ,
 rasterStart ,
 rasterFinish

The vsyncStart variable indicates the start time of the current frame drawing, buildStart/buildFinish indicates the build time of WidgetTree, and rasterStart/rasterFinish indicates the rasterization time on the screen. The total rendering time of a frame can be obtained using the following formula:

 totalSpan => rasterFinish - syncStart

Corresponding to the Google Android Vitals jamming standard: if a frame's totalSpan > 700ms, it is considered that the frame has been frozen, resulting in severe jamming; if more than 30 frames have a totalSpan > 16ms within 1s, slow rendering occurs.

3. Fluency Monitoring Solution

In the fluency monitoring system, individual analysis and mining are performed on the factors that perceive the unsmoothness, aiming to maintain or improve the existing user experience while iteratively optimizing.

The establishment of the monitoring system is divided into the mining of the current situation and optimization direction, the completion of monitoring indicator-dependent data, multi-dimensional data monitoring, and indicator monitoring and early warning.

In the early stage of monitoring, the current performance status of the APP will be analyzed to explore the direction of optimization, and initially obtain the expected benefits brought by the optimization, the number of affected users and other information. For example, it is expected to use the preloading method to reduce the slow loading rate of users, and through the analysis of different user operations in each scenario, as well as the current status of client and server technology implementation (statistics of the size of the message returned by the hotel main service, the pure front-end rendering time of the hotel details, etc.), to determine the coverage and triggering time of slow loading, so as to achieve better results.

Next, while the services and technical reforms for fluency optimization are launched, corresponding monitoring scenarios will be added to support data for quantitative measurement of fluency, laying a solid foundation for subsequent monitoring and early warning.

The core of the monitoring system is the deployment of large-scale and multi-dimensional monitoring. The large-scale data (as shown in the figure below) can quickly and macroscopically understand the user booking experience and clarify the progress of improving fluency. Through data tables of various dimensions, we can find improvement targets and monitor optimization effects.

In actual monitoring, different monitoring standards will be designed for different indicators, such as slow loading, white screen, crash, freeze and other system factors. In addition to the overall indicators, the influence ratio of each indicator, the error rate trend of the hotel main page, the version comparison trend, the top distribution of error models, etc. are also added.

For factors that are heavy in business scenarios, we monitor them by combining business data with bucketing, such as: the number of room types on the details page, the TTI time distribution, the crash data of a single hotel, etc. We also connect with the AB experiment system, and configure fluency observation indicators in the AB system for business and technical transformation needs, and compare the impact of business or technical transformation needs on fluency indicators as a consideration indicator for whether the experiment has passed.

For each indicator, a single fluctuation warning is issued, and a warning is issued when there is an increase or a new increase, so as to ensure that nothing is missed or omitted. For example, the number of business errors on the filling page (services that can be ordered, order submission, and number of out-of-focus errors) is monitored. In addition to monitoring the trend of various error rates, the actual user traffic is integrated to distinguish the traffic size of single business errors and issue warnings. The trigger times are split into multiple dimensions (single user, single room type, etc.) to facilitate the search for characteristic bad cases, quickly locate the problems encountered by users, and explore more business optimization points.

IV. Fluency Management Practice

In the management of APP fluency, we have carried out many optimization practices in three aspects: page startup loading speed, long list jamming management, and page loading flashing. These optimizations do not involve high-end underlying engine optimization technology, nor do they have complex mathematical theoretical foundations, and we do not reinvent the wheel. We insist on being data-oriented, using data to drive solutions, using data to verify solutions, discovering problems, proposing solutions, and solving problems.

4.1 Page loading speed optimization

In terms of page loading speed optimization, we have been conducting iterative optimization since August 2021. The slow loading rate of the hotel booking process page has been reduced from the initial value of 42.90% to the current 8.05%.

In optimizing page startup loading speed, data pre-fetching solutions are generally used. The principle is to obtain service data in advance on the previous page. When the user jumps to the current page, it is directly obtained from the cache, saving the network transmission time of the data and achieving the effect of quickly displaying the current page content. Currently, data pre-loading technology is used in the core hotel reservation process, as shown in the following figure:

Considering the characteristics of hotel business, data preloading needs to consider several aspects: First, the PV volume of hotel booking process pages is high, and the PV volume of hotel list and detail pages is in the tens of millions. It is necessary to consider the timing of data preloading to avoid wasting service resources; second, the hotel list, details, and order filling pages all have price information. Price information is dynamic information for users and may change in real time, so it is necessary to consider the cache strategy of data preloading to avoid user misunderstanding due to inconsistent prices.

4.2 Flutter service channel optimization

The private service protocol used by Ctrip APP is still in the Native code for launching services, while the core hotel page has been moved to Flutter. Through the channel technology provided by the Flutter framework, the data transmission channel from Native to Flutter needs to perform additional serialization and deserialization transmission of the data. At the same time, the transmission process is time-consuming and will block the main thread of UI rendering, which will have a significant impact on page loading. After we detected this link, we worked with the company's framework team to transform the underlying framework of Flutter, which can realize direct transparent transmission of data streams without blocking the main thread of UI, and the performance has been greatly improved.

Before optimization, the data stream returned by the service is passed to Flutter for use. The whole process goes through the following four steps:

PB deserialization
Encoding of Reponse to JsonString
JsonString to Flutter channel transmission
Decoding JsonString to Response

The entire process has a long link, large data transmission volume, and low efficiency, which affects the page loading performance, as shown in the following figure:

After the transformation, the data stream returned by the service is directly transmitted to the Flutter side, and PB is directly deserialized in Flutter, which greatly improves the transmission performance.

PB data stream Flutter channel transmission
PB deserializes to Response

The whole process has a short link, small data transmission volume and high efficiency, as shown in the following figure:

4.3 Analysis and location of lag issues

In Flutter, you can use the Performance Overlay to analyze rendering freeze issues. If the UI freezes, it can help us analyze and find the cause. As shown in the following figure:

The drawing performance of the GPU thread is shown above the chart, and the drawing performance of the CPU UI thread is shown below the chart. The blue vertical line represents the rendered frame, and the green vertical line represents the current frame.

To maintain a 60Hz refresh rate, each frame should take less than 16ms (1/60 second). If one frame takes too long to process, the interface will freeze and a red vertical bar will appear in the chart. The following figure shows how the performance layer looks when the application takes time to render and draw:

If a red vertical bar appears in the GPU thread chart, it means that the rendered graphics are too complex to render quickly; if it appears in the UI thread chart, it means that the Dart code consumes a lot of resources and the code execution time needs to be optimized.

In addition, we can use the Flutter Performance tool in AS to view the rendering performance issues of the Flutter page. There is a very useful function Widget rebuild stats, which counts the number of widget rebuilds when rendering the UI, which can help us quickly locate the problematic widgets, as shown in the following figure:

UI CPU thread problem location

The UI thread problem is actually the performance bottleneck of the application. For example, when building a widget, some complex operations are used in the build method, or time-consuming synchronization operations (such as IO) are performed in the Root Isolate. These will significantly increase the CPU processing time and cause lag.

We can use the Performance tool provided by Flutter to record the execution trajectory of the application. Performance is a powerful performance analysis tool that can display the CPU call stack and execution time in a timeline format to check suspicious method calls in the code. After clicking the "Open DevTools" button in the Flutter Performance toolbar, the system will automatically open the Dart DevTools webpage, and we can start analyzing performance issues in the code.

GPU Problem Location

GPU problems mainly focus on the time-consuming underlying rendering. Sometimes, although the Widget tree is easy to construct, it is very time-consuming to render under the GPU thread. Multi-view overlay rendering such as Widget clipping and masking, or repeated drawing of static images due to lack of cache, will significantly slow down the GPU rendering speed. You can use the two parameters provided by the performance layer, which are responsible for checking the view rendering switch checkerboardOffscreenLayers for multi-view overlay and the image switch checkerboardRasterCacheImages for checking cache.

checkerboardOffscreenLayers

Multi-view overlay usually uses the saveLayer method in Canvas, which is very useful for achieving some specific effects (such as translucency), but because its underlying implementation involves repeated drawing of multiple layers on GPU rendering, it will bring greater performance issues. To check the usage of the saveLayer method, we only need to set the checkerboardOffscreenLayers switch to true in the initialization method of MaterialApp, and the analysis tool will automatically help us detect the multi-view overlay. The Widget using saveLayer will automatically display in a checkerboard format and flash as the page is refreshed.

However, saveLayer is a relatively low-level drawing method, so we generally do not use it directly, but use it indirectly through some functional widgets in scenes involving clipping or semi-transparent masks. So once you encounter this situation, you need to think about whether you must do it, and whether you can achieve it in other ways. As shown in the figure below, because the detail header bar uses Gaussian blur and ClipRRect is used to cut the rounded corners, ClipRRect will call the savelayer interface, so this part flickers.

checkerboardRasterCacheImages

From a resource perspective, another type of operation that consumes a lot of performance is rendering images. This is because image rendering involves I/O, GPU storage, and data format conversion of different channels, so the construction of the rendering process consumes a lot of resources.

To relieve the pressure on the GPU, Flutter provides multi-level cache snapshots so that the static images do not need to be redrawn when the Widget is rebuilt. Similar to the checkerboardOffscreenLayers parameter for checking multi-view overlay rendering, Flutter also provides a switch for checking cache images, checkerboardRasterCacheImages, to detect images that frequently flash when the interface is redrawn (that is, there is no static cache).

We can add the image that needs to be statically cached to RepaintBoundary. RepaintBoundary can determine the repainting boundary of the Widget tree. If the image is complex enough, the Flutter engine will automatically cache it to avoid repeated refreshes. Of course, because cache resources are limited, if the engine thinks the image is not complex enough, it may ignore RepaintBoundary.

4.4 Optimization of Ctrip React Native (CRN) page

The following figure shows the basic CRN page loading process. The optimization of each stage has been described in previous articles, such as container preloading, bundle splitting, container reuse, framework preloading, etc., which have been optimized at the container level.

Take the hotel order filling page as an example. This page adopts the CRN architecture. After various container-level and framework-level optimizations, we focused on managing the redrawing within the page and took the redrawing management to the extreme, mainly involving "5. First rendering of the first screen" and "7. Second rendering of the first screen" in the above figure.

4.4.1 Action integration within the page

This page uses the Redux architecture. After several years of extensive development, there are many actions in the page (Action triggers changes in state management through asynchronous events to achieve the purpose of page redrawing. You can refer to the Action-Reducer-Store mode of Redux).

Before optimization, as shown in the figure below, page initialization/start loading/loading/loading completion all trigger multiple actions. Since the action is asynchronous, each data processing module is time-consuming and asynchronous. After loading is completed, the page may have been refreshed, and unprocessed data may be displayed here. After the subsequent actions are executed, the page will be refreshed again.

Due to data changes, the elements in the page may change, which will cause page jitter for users. At the same time, it will increase the communication volume of JS<=>Native. The continuous changes in the elements in the page will also continuously refresh the rendering tree in native, consuming a lot of CPU time, which will cause the page to be unsmooth and time-consuming.

In response to the above situation, we have integrated the Actions within the page:

Avoid using actions for static data
Try to merge actions with the same triggering timing
Lazy loading of non-essential data
Integrate updates of multiple layers of actions

After integration, the actions on the page are roughly as follows: basically there are only page initialization, main service return, and subsequent sub-service actions.

In this process, we use redux-logger to monitor actions, and use MessageQueue to monitor the situation where action changes trigger refreshes, as shown below:

4.4.2 Control redraw management

In order to better control the frequency of control redrawing, we split the control as follows:

Disassemble the components as much as possible
Reduce the complexity of a single file
Component reuse is more convenient
Less dependent data, better state management
Local update data does not affect other components
Use Fragments to avoid multi-layer nesting

After the split, the component granularity is smaller. PureComponent is used for weak business-related components, and Component+shouldComponentUpdate+self-comparison of property changes is used for strong business components to avoid component redrawing.

Through the above-mentioned governance, when entering the filling page, it is obvious that the page is lighter. After the main service returns, the page can be refreshed immediately, and the rendering speed of the page is greatly improved.

Redraw management We use the solution of https://github.com/welldone-software/why-did-you-render to detect why the component is redrawn, as shown below:

V. Planning and Summary

In the entire APP fluency management, the fluency rate has increased from the initial 47% to the current 80%, the slow page loading rate has been reduced from the original 45% to the current 8%, the white screen rate has been reduced from 1.9% to the current 0.3%, and the flickering of the main process page controls has been basically eliminated. The APP performance and user experience have been significantly improved.

Looking back at the fluency practice of the Chinese hotel APP in the past six months, the whole process was difficult and accompanied by anxiety. Every improvement in fluency was not achieved overnight or easily. But for the entire team, it was a great harvest. During the entire practice, we made an overall upgrade to the flutter engineering architecture, especially the transformation of the data transmission layer and the closing of the business layer logic. The data preloading solution was also upgraded from version 1.0 to version 2.0. Most importantly, the entire team formed a data quantification ideology and optimized and solved problems from the user's perspective.

Currently, version 2.0 of Smoothness has been put into practice. 2.0 adds more factors that cause unsmoothness to the smoothness statistics, such as secondary loading of the main service, slow loading of maps, slow loading of images and videos, failure to load images and videos, pop-ups and prompt messages, etc., to improve the user's booking experience from more system and business levels.

<<: Re-understanding the R8 compiler from an online question

>>: The practice of multi-terminal integration technology for merchants and stores

Arrogant Apple finally figured out these three reasons and decided to join the AI Alliance

How can education and training companies divert traffic from Douyin, Weibo, Himalaya, etc. to "private domain traffic" for conversion?

Blog

Antioxidant King! People who often eat these 6 types of food really age slowly

Blog

What does Shanghai’s full-area static management mean in 2022? When will it be lifted? Attached is the latest official notice

Recently, the local epidemic situation in Shanghai...

My memory of the college entrance examination | Who said that "idle books" are useless? He applied for astronomy because of popular science books.

Interviewed guest: Li Ran, researcher at the Nati...

2022 Ningbo Tea Studio is recommended to all friends

Ningbo Tea Studio recommends high-end tea SPA in-...

From Kuaishou mini games to multi-platform and multi-format monetization, the road to multi-platform monetization of small animation videos

From Kuaishou mini games to multi-platform and mu...

Affected by the epidemic, domestic smartphone shipments in Q1 may drop by 30%

The ongoing outbreak of the novel coronavirus is ...

1. Background

2. Definition of Fluency Index

2.1 Page interactive loading time

2.2 Principles of collecting interactive loading time of pages

2.2.1 Principle of collecting interactive loading time of Flutter pages​

2.2.2 Ctrip React Native page interactive loading time collection principle​

2.3 Rendering Stuttering and Frame Rate

2.3.1 Jamming Standards​

2.3.2 Stuttering Quantification

3. Fluency Monitoring Solution

IV. Fluency Management Practice

4.1 Page loading speed optimization

4.2 Flutter service channel optimization

4.3 Analysis and location of lag issues

4.4 Optimization of Ctrip React Native (CRN) page

4.4.1 Action integration within the page

4.4.2 Control redraw management

V. Planning and Summary

Recommend

2.2.1 Principle of collecting interactive loading time of Flutter pages

2.2.2 Ctrip React Native page interactive loading time collection principle

2.3.1 Jamming Standards