Dewu App Android Cold Start Optimization - Application

Dewu App Android Cold Start Optimization - Application

Preface

Cold start indicators are very important indicators in the app experience, and they have a significant impact on the user's retention intention in e-commerce apps. It usually refers to the time it takes for the app process to start and the first frame of the homepage to appear, but from the perspective of user experience, it should be from the time the user clicks the app icon to the time the homepage content is fully displayed.

The design of allocating startup phase work into tasks and constructing a directed acyclic graph is already the standard startup framework of componentized apps at this stage. However, due to the performance bottleneck of mobile terminals, improper use of high concurrency design often leads to frequent time-consuming problems such as lock competition and disk IO blocking. How to go a step further and maximize the use of limited resources in the limited time of the startup phase, and minimize the main thread time while ensuring the stability of business functions, is the topic that this article will discuss.

This article will introduce how we reduced the online startup indicators of the Douyin App by 10% and the offline indicators by 34% through unified management of system resources in the startup phase, on-demand allocation and staggered loading, and promoted it to the top 3 among similar e-commerce apps.

1. Indicator selection

Traditional performance monitoring indicators usually start with the Application's attachBaseContext callback and end with the decorView.postDraw task on the home page. However, this does not take into account the time required to load dex and initialize contentProvider.

Therefore, in order to be closer to the user's real experience, we added an offline user sensory indicator based on the startup speed monitoring indicator. By analyzing the screen recording file frame by frame, we find the time when the App icon clicks on the animation to start playing (the icon becomes darker) as the starting frame, and the first frame where the home page content appears as the ending frame. The result is calculated as the startup time.

Example: The startup process is 03:00 - 03:88, so the startup time is 880ms.

picture

picture

2. Application Optimization

Apps may land on different homepages (community/transaction/H5) in different business scenarios, but the application running process is basically fixed and rarely changes, so application optimization is our first choice.

The startup framework tasks of the Douyin App have undergone multiple rounds of optimization in recent years. The conventional trace capture to find time-consuming points and asynchronization can no longer bring obvious benefits. It is necessary to explore optimization points from the perspective of lock contention and CPU utilization. This type of optimization may not bring obvious benefits in the short term, but in the long run it can avoid many degradation problems in advance.

1. WebView Optimization

When the App calls the webview constructor for the first time, it will start the system's initialization process for the webview, which usually takes 200+ms. The conventional idea for such a time-consuming task is to directly throw it to the child thread for execution, but the Chrome kernel has added a lot of thread checks, so that the webview can only be used in the thread that constructs it.

picture

In order to speed up the startup of H5 pages, Apps usually choose to initialize and cache webview in the Application stage. However, the initialization of webview involves cross-process interaction and file reading. Therefore, any shortage of CPU time slices, disk resources and binder thread pool will cause its time consumption to expand. The Application stage has many tasks, and it is easy to encounter the above resource shortages.

picture

Therefore, we split the webview into three steps and distribute them to different stages of startup to execute. This can reduce the time-consuming expansion problem caused by competing resources and greatly reduce the probability of ANR.

picture

1.1 Task Splitting

a. Provider preloading

WebViewFactoryProvider is an interface class used to interact with the webview rendering process. The first step of webview initialization is to load the system webview apk file, build a classloader and reflectively create a static instance of WebViewFactoryProvider. This operation does not involve thread checking, so we can directly hand it over to the child thread for execution.

picture

b. Initialize the webview rendering process

This step corresponds to the WebViewChromiumAwInit.ensureChromiumStartedLocked() method in the Chrome kernel, which is the most time-consuming part of webview initialization, but it is executed continuously with the third step. Code analysis found that in the interface exposed by WebViewFactoryProvider to the application, the getStatics method will just trigger the ensureChromiumStartedLocked method.

At this point, we can achieve the purpose of initializing only the webview rendering process by executing WebSettings.getDefaultUserAgent().

picture

picture

picture

c. Construct webview

That is, new Webview()

1.2 Task Allocation

In order to minimize the main thread time, our tasks are arranged as follows:

a. Provider preloading can be executed asynchronously without any pre-dependencies, so it can be executed asynchronously at the earliest time point in the Application phase.

b. Initialize the webview rendering process, which must be in the main thread, so put it after the first frame of the home page is finished.

c. Constructing the webview must be done in the main thread, and posting to the main thread when the second step is completed. This ensures that the second step is not in the same message, reducing the chance of ANR.

picture

1.3 Summary

Although we have split the webview initialization into three parts, the second step, which takes the longest time, may still reach the ANR threshold on low-end devices or in extreme cases. Therefore, we have made some restrictions. For example, the current device will count and record the time taken for the complete initialization of the webview. Only when the time is lower than the threshold sent by the configuration, the above-mentioned segmented execution optimization will be enabled.

If the App is opened through channels such as push and delivery, the page that is opened is most likely an H5 marketing page. Therefore, this type of scenario is not suitable for the above-mentioned segmented loading. Therefore, it is necessary to hook the messageQueue of the main thread, parse the intent information of the startup page, and then make a judgment.

Due to the limitations of the splash screen ad function, we can currently only enable this optimization for startup scenarios without splash screen ads. We will later plan to use the gaps in the ad countdown to execute step 2 to cover scenarios with splash screen ads.

picture

2.ARouter Optimization

In the current era of popular componentization, routing components have become almost essential basic components for all large Android apps. Currently, Dewu uses the open source ARouter framework.

The ARouter framework is designed to use the first route level in the registration path in the annotation (e.g. trade in "/trade/homePage") as the group of the route information by default. The route information of the same group path will be merged into the registration function of the same class generated in the end for synchronous registration. In large projects, for complex business lines, the same group may contain hundreds of registration information, and the registration logic execution process takes a long time. For example, the business line with the most routes has taken 150+ms to initialize the route.

picture

The registration logic of the route itself is lazy loaded, that is, the route registration operation will be triggered when the first route component under the corresponding Group is called. However, ARouter uses the SPI (service discovery) mechanism to help business components expose some interfaces to the outside world, so that some business layer views can be called without relying on business components. When developing these services, developers usually habitually set the route path for them according to the components they belong to, which will trigger the loading of routes under the same Group when these services are constructed for the first time.

In the Application stage, some interfaces in the services of the business module will definitely be needed, which will trigger the routing registration operation in advance. Although this operation can be performed in an asynchronous thread, most of the work in the Application stage requires access to these services. Therefore, when the time taken to construct these services for the first time increases, the overall startup time is bound to increase accordingly.

2.1 ARouter Service Routing Separation

The original intention of ARouter adopting SPI design is to decouple, and the role of Service should only be to provide an interface, so a new Service with an empty implementation should be added to trigger route loading, and the original Service needs to replace a Group and only be used to provide an interface in the future. In this way, other tasks in the Application stage do not need to wait for the completion of the route loading task.

picture

2.2 ARouter supports concurrent loading of routes

After implementing route separation, we found that the total time consumed by loading existing hotspot routes is greater than the time consumed by the Application. In order to ensure that the route loading is completed before entering the splash screen page, the main thread has to sleep and wait for the route loading to be completed.

Analysis shows that ARouter's route loading method adds a class lock because it needs to load the route into the map in the warehouse class. These maps are thread-unsafe HashMaps, which means that all route loading operations are actually executed serially and there is lock competition, which ultimately leads to the cumulative time consumption being greater than the Application time consumption.

picture

picture

Analysis of the trace shows that the time consumption mainly comes from the frequent calls to the loadInto operation of the loading route. Analyzing the role of the lock here, it can be seen that the class lock is mainly added to ensure the thread safety of the map operation in the warehouse WareHouse.

picture

Therefore, we can downgrade the class lock to lock the GroupMeta class object (this class is generated by ARouter apt, corresponding to the ARouter$$Provider$$xxx class in apk) to ensure thread safety during route loading. As for the thread safety issues of map operations before this, they can be completely solved by replacing these maps with concurrentHashMap. There will be some thread safety issues in extreme concurrency situations, which can also be solved by adding null judgment as shown in the figure.

picture

picture

So far, we have realized the concurrent loading of routes. Then we group the services to be preloaded reasonably according to the bucket effect, and then put them into the coroutine for concurrent execution to ensure the shortest overall time.

picture

picture

3. Lock optimization

The tasks performed in the Application stage are mostly the initialization of the basic SDK. The running logic is usually relatively independent, but there will be dependencies between SDKs (for example, the tracking library will depend on the network library), and most of them will involve operations such as reading files and loading so libraries. In order to reduce the time consumption of the main thread, the Application stage will put time-consuming operations into sub-threads as much as possible to run concurrently and make full use of the CPU time slice, but this will inevitably lead to some lock contention problems.

3.1 Load so lock

The System.loadLibrary() method is used to load the so library in the current apk. This method locks the Runtime object, which is equivalent to a class lock.

The basic SDK is usually designed to write the load so operation into the static code block of the class to ensure that the so library is ready before the SDK initialization code is executed. If this basic SDK happens to be a basic library such as a network library, it will be called by many other SDKs, and multiple threads will compete for the lock at the same time. In the worst case, when IO resources are tight, reading so files becomes slow, and the main thread is the last one in the lock waiting queue, the startup time will be much longer than expected.

picture

To this end, we need to unify the loadSo operations and converge them into one thread, forcing them to run in serial mode, so as to avoid the above situation. It is worth mentioning that the so file in webview.apk will also be loaded during the preloading process of the webview provider, so it is necessary to ensure that the preloadProvider operation is also placed in this thread.

The loading operation of so will trigger the JNI_onload method of the native layer. Some so may perform some initialization work in it. Therefore, we cannot directly call the System.loadLibrary() method to load so, otherwise repeated initialization may cause problems.

We finally adopted the class loading method, that is, moving all the codes loaded by these sos into the static code block of the relevant class, and then triggering the loading of these classes. The class loading mechanism is used to ensure that the loading operations of these sos will not be executed repeatedly. At the same time, the order of loading these classes must also be arranged according to the order in which these sos are used.

picture

In addition, it is not recommended to execute the so loading task concurrently with other tasks that require IO resources. In the actual measurement of the Douyin App, the time consumption of this task in these two cases is very different.

4. Start framework optimization

Currently, the common startup framework design is to distribute the work in the startup phase to a group of task nodes, and then construct a directed acyclic graph based on the dependencies of these task nodes. However, with the iteration of business, some historical task dependencies are no longer necessary, but they will slow down the overall startup speed.

Most of the work in the startup phase is the initialization of the basic SDK, which often has complex dependencies. When we optimize the startup, in order to reduce the time consumption of the main thread, we usually find the time-consuming tasks of the main thread and throw them to the child thread for execution. However, in the Application stage with complex dependencies, if we just throw it to asynchronous execution, it may not have the expected benefits.

After optimizing webview, we found that the startup time did not directly reduce the time of webview initialization as expected, but was only about half of the expected time. After analysis, we found that our main thread task depends on the child thread task, so when the child thread task is not completed, the main thread will sleep and wait.

And the reason why webview is initialized at this time point is not because of dependency restrictions, but because the main thread happens to have a relatively long sleep time that can be used during this period, but the workload of asynchronous tasks is much greater than that of the main thread. Even if seven sub-threads are running concurrently, their time consumption is greater than that of the main thread's task.

Therefore, if you want to further expand the benefits, you have to optimize the task dependencies in the startup framework.

picture

picture

The first picture above is a directed acyclic graph of tasks in the startup phase of the Dewu App before optimization. The red box indicates that the task is executed in the main thread. We focus on tasks that block the execution of the main thread tasks.

It can be observed that there are several tasks with many exits and entrances on the dependency link of the main thread task. Many exits indicate that such tasks are usually very important basic libraries (such as the network library in the figure), while many entrances indicate that this task has too many pre-dependencies and the time point at which it starts to execute fluctuates greatly. These two points combined indicate that the time point at which the execution of this task ends is very unstable and will directly affect the subsequent main thread tasks.

The main ideas for optimizing this type of task are:

Disassemble the task itself and separate the operations that can be executed earlier or later. However, before dividing them, consider whether there is time slot margin in the corresponding time period or whether it will aggravate the IO resource competition.

Optimize the task's predecessor and make the task finish as early as possible, which can reduce the time that subsequent tasks spend waiting for the task.

Remove unnecessary dependencies. For example, the initialization of the tracking library only needs to register a listener to the network library, not initiate a network request. (Recommended)

We can see that in the second directed acyclic graph after optimization, the dependency levels of tasks are significantly reduced, and tasks with a large number of entries and exits basically no longer appear.

picture

picture

By comparing the traces before and after optimization, we can also see that the task concurrency of the sub-threads is significantly improved. However, the higher the task concurrency, the better. On low-end machines where the time slice is insufficient, the higher the concurrency, the worse the performance may be, because it is more likely to have lock contention, IO waiting and other problems. Therefore, it is necessary to leave some gaps and conduct sufficient performance tests on mid- and low-end machines before going online, or use different task schedules for high, medium and low-end machines.

3. Homepage Optimization

1. Time-consuming optimization of general layout

The system parses the layout by reading the layout XML file through the inflate method and parsing and constructing the view tree. This process involves IO operations and is easily affected by the device status. Therefore, we can use apt to parse the layout file during compilation to generate the corresponding view construction class. Then, at runtime, these class methods are asynchronously executed in advance to build and assemble the view tree, which can directly optimize the time consumed by page inflation.

picture

picture

2. Message scheduling optimization

During the startup phase, we usually register some ActivityLifecycleListeners to monitor the page life cycle, or post some delayed tasks to the main thread. If there are time-consuming operations in these tasks, it will affect the startup speed. Therefore, we can hook the message queue of the main thread and move the page life cycle callback and page drawing related msg to the head of the message queue, so as to speed up the display of the first frame of the home page.

picture

Please look forward to the subsequent content of this series for details.

4. Stability

Performance optimization is only the icing on the cake for an App, while stability is the lifeline. The optimization and transformation are all started at the very early stage of the Application, and the risk of stability is very high. Therefore, optimization must be done under the premise of being prepared for crash protection. Even if there are inevitable stability issues, the negative impact must be minimized.

1. Crash protection

Since the tasks performed in the startup phase are all important basic library initialization, it is not very meaningful to identify and consume the exception when a crash occurs, because it is likely to cause subsequent crashes or functional abnormalities. Therefore, our main protection work is to stop the bleeding after the problem occurs.

The configuration center SDK is usually designed to read the cached configuration from the local file and refresh it after the interface request is successful. Therefore, if a crash occurs after the configuration is hit during the startup phase, the new configuration cannot be pulled. In this case, the only option is to clear the App cache or uninstall and reinstall, which will cause very serious user loss.

picture

  • Crash fallback

Add try-catch protection to all the change points, report the tracking point after catching the exception and write the crash mark bit to MMKV, so that the device will no longer enable the startup optimization related changes in the current version, and then throw the original exception to make it crash. As for native crash, perform the same operation in the native crash callback of Crash monitoring.

picture

  • Running status detection

We can capture Java Crash by registering unCaughtExceptionHandler, but native crash needs to be captured with the help of crash monitoring SDK. However, crash monitoring may not be initialized at the earliest time point of startup. For example, the preloading of Webview's Provider and the preloading of so library are earlier than crash monitoring, and these operations involve native layer code.

To avoid the crash risk in this scenario, we can embed the MMKV marker at the start point of the Application and change it to another state at the end point. In this way, some codes that are executed earlier than the configuration center can determine whether the last run was normal by obtaining this marker. If some unknown crash occurred during the last startup (for example, a native crash that occurred before the crash monitoring was initialized), the startup optimization changes can be turned off in time through this marker.

Combined with the automatic restart operation after the crash, the crash is actually not noticeable from the user's perspective, but the startup time will feel about 1-2 times longer than usual.

picture

  • Configuration validity period

Online technical changes usually configure the sampling rate and combine it with random numbers to gradually increase the volume. However, the design of the SDK sent by the configuration usually defaults to the last local cache. When an online crash or other failure occurs, although the configuration is rolled back in time, the cache design will cause users to encounter at least one crash due to the cache.

To this end, we can add a matching expiration timestamp to each switch configuration, limiting the current volume switch to take effect only before this timestamp. This ensures that the bleeding can be stopped in time when encountering online crashes and other failures. The timestamp design can also avoid crashes caused by the lag in the effectiveness of online configurations.

picture

From the user's perspective, compare before and after adding the configuration validity period:

picture

V. Conclusion

So far, we have analyzed the more common cold start time-consuming cases in Android apps. However, the biggest pain point of startup optimization is often the business code of the App itself. Task allocation should be reasonable based on business needs. If you rely solely on preloading, delayed loading and asynchronous loading, you cannot fundamentally solve the time-consuming problem, because the time consumption does not disappear, it is just transferred, and what may follow is startup degradation or functional abnormalities on low-end machines.

Performance optimization requires not only standing from the user's perspective, but also a global perspective. If time-consuming tasks are thrown after the first frame of the homepage because the startup indicator is considered to be the end of the first frame, it will inevitably cause users to experience lag or even ANR in the subsequent experience. Therefore, when splitting tasks, it is necessary not only to consider whether they will compete for resources with concurrent tasks, but also to consider whether the functional stability and performance of each startup stage and a period of time after startup will be affected, and it is necessary to verify on high-end, medium-end and low-end machines to at least ensure that there is no degraded performance.

1. Anti-deterioration

Startup optimization is by no means a one-time job. It requires long-term maintenance and polishing. A technical transformation of the basic library may cause the indicators to return to the pre-liberation level overnight. Therefore, anti-degradation must be implemented as soon as possible.

By adding tracking points at key points, we can quickly locate the approximate location of the degraded code (such as onCreate of xxActivity) and issue an alarm when online indicators degrade. This can not only help R&D to quickly locate the problem, but also avoid the situation where the degradation of indicators in specific online scenarios cannot be reproduced offline. Because the fluctuation range of the time consumption of a single startup can be as high as 20%, if you directly capture the trace analysis, it may be difficult to even locate the approximate range of the degradation.

For example, when comparing traces of two startups, one of them may encounter IO congestion, causing a file reading operation to be significantly slower, while the other IO is normal. This will mislead developers to analyze these normal codes, while the code that actually causes the degradation may be covered up by the fluctuation.

2. Outlook

For common scenarios started by clicking an icon, the default initialization work will be performed in the Application. However, for some deeper functions, such as the customer service center and editing the delivery address, even if the user enters these pages directly at the fastest speed, it takes at least 1 second to operate. Therefore, the initialization work related to these functions can be postponed to after the Application, or even changed to lazy loading, depending on the importance of the specific function.

The startup scenarios of recalling/attracting new users through delivery and push usually account for a small proportion, but their business value is far greater than ordinary scenarios. Since the current startup time mainly comes from webview initialization and some homepage preloading related tasks, if the startup landing page does not require all basic libraries (such as H5 pages), then we can delay the loading of all unnecessary tasks, so that the startup speed can be greatly increased, and it can be truly opened in seconds.

<<:  With the support of security chips, Android data encryption is becoming increasingly difficult to steal

>>:  Technical analysis of AndroidManifest.xml multiple obfuscation bypass static analysis (Zip format type modification bypass)

Recommend

Strategic plan for online operations during the epidemic period

The following is an outline of the content of thi...

A case study of brand IPization that achieved great results with little effort

Every brand wants to build itself into an IP, but...

iOS 12 wish list: These are the features Apple fans want

As we all know, Apple will release iOS 12 at the ...

Apple sells ads on the App Store, what does this mean?

How are Apple App Store’s default search ads made...

Investing in cyclical industries and commodity cycles

Introduction to cyclical industry investment and ...

MINISO’s private domain traffic growth method

In 2020, the offline industry was hit by the epid...

How to design a landing page that increases conversion rate by 30-50%? !

In the process of display advertising , when you h...

Douyin promotion: Douyin classification and monetization methods

Tik Tok. A very popular short video APP with a ve...

How can I promote more cost-effectively in Baidu SEM?

Now Dongguan Feng Chao editor has found that more...