Startup performance is the face of the app user experience. A long startup process is likely to reduce users' interest in using the app. Douyin has also verified its significant impact on business indicators through degradation experiments on startup performance. Douyin has hundreds of millions of daily active users, and an increase of a few hundred milliseconds in startup time may lead to a reduction in retention of tens of thousands of users. Therefore, optimizing startup performance has become the top priority of Douyin's Android basic technology team in the direction of experience optimization. In the previous article on startup performance optimization: theory and tools, we have introduced the startup performance optimization of Douyin from the perspectives of principles, methodology, and tools. This article will introduce the solutions and ideas for Douyin startup optimization from a practical perspective through specific case analysis. PrefaceStartup refers to the entire process from when a user clicks an icon to when they see the first frame of the page. The goal of startup optimization is to reduce the time spent in this process. The startup process is relatively complex. In terms of processes and threads, it involves multiple cross-process communications and switching between multiple threads. In terms of time-consuming causes, it includes CPU, CPU scheduling, IO, lock waiting and other types of time-consuming. Although the startup process is relatively complex, we can eventually abstract it into a linear process of the main thread. Therefore, the optimization of startup performance is to shorten this linear process of the main thread. Next, we will introduce some typical cases encountered by the team in the practice of startup optimization according to the logic of direct optimization of the main thread, indirect optimization of the background thread, and global optimization. In the meantime, we will also briefly introduce some of the better solutions in the industry. Optimization case analysis1. Main thread direct optimizationFor the optimization of the main thread, we will introduce it in the order of the life cycle. 1.1 MutilDex OptimizationFirst, let's look at the first stage, which is the attachBaseContext stage of Application. Due to issues such as Application Context assignment, this stage generally does not have too much business code and is not expected to take much time. However, during actual testing, we found that on some models, the first launch of the application after installation takes a very long time. After preliminary positioning, it was found that the main time-consuming process was MultiDex.install. After detailed analysis, we determined that the issue is concentrated on 4.x models, affecting the first installation and the first startup after subsequent updates. The reason for this problem is that the instruction format of dex is not well designed. The total number of Java methods referenced in a single dex file cannot exceed 65536. If the number of methods exceeds 65536, it will be split into multiple dex files. Generally, the Dalvik virtual machine can only execute optimized odex files. In order to improve the installation speed of applications on 4.x devices, it will only optimize the first dex of the application during the installation phase. For non-first dex, it will optimize when calling MultiDex.install for the first time, and this optimization is very time-consuming, which causes the problem of slow first startup on 4.x devices. There are several necessary conditions for this problem, namely: dex is split into multiple dex files, only the first dex is optimized during installation, MultiDex.install is called during startup, and the Dalvik virtual machine needs to load odex. Obviously, we cannot break the first two conditions - for TikTok, it is difficult for us to optimize it to have only a single dex, and we cannot change the system installation process. The condition of calling MultiDex.install at the startup stage is also difficult to break - first, as the business expands, it is difficult for us to use a single dex to carry the startup code; second, even if we do, it is difficult to maintain it later. Therefore, we choose to break the restriction that "Dalvik virtual machine needs to load odex", that is, bypass the restriction of Dalvik and directly load the unoptimized dex. The core of this solution is the native function Dalvik_dalvik_system_DexFile_openDexFile_bytearray, which supports loading unoptimized dex files. The specific optimization plan is as follows:
For more details about MutilDex optimization, please refer to a previous public account article. The solution is now open source. For details, please see the github address of the project (https://github.com/bytedance/BoostMultiDex). 1.2 ContentProvider OptimizationNext, we will introduce the optimization of ContentProvider. As one of the four major components of Android, ContentProvider is unique in terms of life cycle. Activity, Service, BroadcastReceiver are instantiated and execute their life cycle only when they are called. Even if ContentProvider is not called, it will be automatically instantiated and execute the relevant life cycle at the startup stage. After calling the attachBaseContext method of Application in the initialization stage of the process, the installContentProviders method will be executed to install all ContentProviders of the current process. This process will instantiate all ContentProviders of the current process one by one through a for loop, call their attachInfo and onCreate lifecycle methods, and finally publish the ContentProviderHolder associated with these ContentProviders to the AMS process at one time. The feature of ContentProvider that is automatically initialized during the process initialization phase makes it a cross-process communication component and also used by some modules for automatic initialization. The most typical of these is the official Lifecycle component, which is initialized with the help of a ContentProvider called ProcessLifecycleOwnerInitializer. The initialization of LifeCycle only registers the LifecycleCallbacks of the Activity, which does not take much time. We do not need to do too much optimization at the logical level. It is worth noting that if there are many ContentProviders used for initialization, the creation and lifecycle execution of the ContentProvider itself will be very time-consuming. To address this problem, we can use the Startup provided by JetPack to aggregate multiple initialized ContentProviders into one for optimization. In addition to this type of ContentProvider that takes very little time, we also found some ContentProviders that take a long time during the actual optimization process. Here is a brief introduction to our optimization ideas. public class ProcessLifecycleOwnerInitializer extends ContentProvider { For our own ContentProvider, if initialization takes time, we can change automatic initialization to on-demand initialization by refactoring. For some third-party or even official ContentProviders, it is not possible to optimize them directly by refactoring. Here we take the official FileProvider as an example to introduce our optimization ideas. FileProvider Usage FileProvider is a component introduced in Android 7.0 for file access control. Before the introduction of FileProvider, we directly passed the file Uri for some cross-process file operations such as taking photos. After the introduction of FileProvider, our entire process is:
Time consumption analysis From the above process, as long as we do not call FileProvider in the startup phase, there will be no FileProvider-related time consumption. But in fact, from the startup trace, we can see that there is FileProvider-related time consumption in our startup phase, and the specific time consumption is in the attachInfo method of FileProvider's life cycle. In addition to calling the onCreate method that we are most familiar with, the attachInfo method of FileProvider will also call the getPathStrategy method. Our time consumption is concentrated in this getPathStrategy method. From the implementation point of view, the getPathStrategy method mainly parses the XML file associated with FileProvider, and the parsing result will be assigned to the mStrategy variable. Further analysis will find that mStrategy will be used when FileProvider's query, getType, openFile and other interfaces perform file path verification, while our query, getType, openFile and other interfaces will not be called during the startup phase. Therefore, the getPathStrategy in the FileProvider attachInfo method is completely unnecessary. We can completely execute the getPathStrategy logic when the query, getType, openFile and other interfaces are called. Optimization plan FileProvider is the code in androidx, we cannot modify it directly, but it will participate in our code compilation. We can modify its implementation by modifying the bytecode during the compilation phase. The specific implementation plan is:
public void attachInfo ( @ NonNull Context context , @ NonNull ProviderInfo info ) {
Although a single FileProvider does not take much time, some large apps may have multiple FileProviders for module decoupling. In this case, the benefits of FileProvider optimization are still considerable. Similar to FileProvider, the WorkManager provided by Google also has an initialized ContentProvider, and we can use a similar method to optimize it. 1.3 Startup task reconstruction and task schedulingThe third stage of startup is the onCreate stage of Application. This stage is the peak stage of startup task execution. The optimization of this stage is aimed at the optimization of various startup tasks and has a strong business relevance. Here is a brief introduction to our general optimization ideas. The core idea of Douyin startup task optimization is to maximize code value and resource utilization. Code value maximization is mainly to determine which tasks should be executed in the startup phase, and its core goal is to remove tasks that should not be executed in the startup phase from the startup phase; resource utilization maximization is to use system resources as much as possible to reduce the time consumption of task execution when the tasks in the startup phase have been determined. For a single task, we need to optimize its internal implementation, reduce its own resource consumption to provide more resources for other tasks to execute, and for multiple tasks, we can make full use of system resources through reasonable scheduling. From the implementation perspective, we mainly focus on two things: starting task reconstruction and task scheduling. Startup Task Refactoring Due to the high complexity of the business and the relatively loose control over startup tasks in the early stage, there are more than 300 tasks in the startup phase of Douyin. In this case, scheduling the tasks in the startup phase can improve the startup speed to a certain extent, but it is still difficult to increase the startup speed to a higher level. Therefore, a very important direction in startup optimization is to reduce startup tasks. For this reason, we divide the startup tasks into three categories: configuration tasks, preloading tasks, and functional tasks. Configuration tasks are mainly used to initialize various SDKs. Before they are executed, the related SDKs cannot work; preloading tasks are mainly used to preheat certain subsequent functions to improve the execution speed of subsequent functions; functional tasks are function-related tasks executed during the life cycle of process startup. We use different transformation methods for these three types of tasks:
Task Scheduling There have been many introductions to task scheduling in the industry. Here, we will not introduce task dependency analysis, task arrangement, etc., but mainly introduce some possible innovations of Douyin in practice:
1.4 Activity Phase OptimizationThe previous stages all belong to the Application stage. Next, let's look at the related optimizations in the Activity stage. In this stage, we will introduce two typical examples: merging Splash and Main and deserialization optimization. 1.4.1 Splash and Main merge First, let's look at the merger of SplashActivity and MainActivity. In previous versions, the launcher activity of Douyin was SplashActivity, which mainly carried the logic related to opening the screen such as advertisements and activities. In general, our startup process is:
In this process, our startup needs to go through the startup of two Activities. If these two Activities are merged, we can gain two benefits:
To merge Splash and Main, we need to solve two main problems:
The first problem is relatively easy to solve. We can use activity-alias+targetActivity to point SplashActivity to MainActivity. Next, let's look at the second problem. The launchMode Problem Before Splash and Main are merged, the LaunchMode of SplashActivity and MainActivity are standard and sinngletask respectively. In this case, we can ensure that there is only one instance of MainActivity, and when we exit the application home and re-enter, we can return to the previous page. After merging SplashActivity with MainActivity, our launcher Activity becomes MainActivity. If we continue to use singletask as the launchMode, when we exit the secondary page home and click the icon again, we will not be able to return to the secondary page, but will return to the Main page. Therefore, after the merger, the launch mode of MainActivity will no longer be able to use singletask. After research, we finally chose to use singletop as our launchMode. Multiple instance issues 1. Issues with starting multiple instances internally Although using singletop can solve the problem of not being able to return to the previous page after exiting home and re-entering, it also brings the problem of multiple instances of MainActivity. In Douyin's logic, there are some logics that are strongly related to the life cycle of MainActivity. If there are multiple instances of MainActivity, this part of the logic will be affected. At the same time, the implementation of multiple MainActivities will also lead to unnecessary resource overhead, which is not in line with expectations. Therefore, we hope to solve this problem. Our solution to this problem is to add the FLAG_ACTIVITY_NEW_TASK and FLAG_ACTIVITY_CLEAR_TOP flags to all Intents that start MainActivity in the application to achieve a clear top feature similar to singletask. Using the FLAG_ACTIVITY_NEW_TASK + FLAG_ACTIVITY_CLEAR_TOP solution, we can basically solve the problem of starting multiple instances of MainActivity internally. However, during the actual test, we found that on some systems, even if the clear top feature is implemented, the problem of multiple instances still exists. After analysis, we found that on this part of the system, even if SplashActivity is pointed to MainActivity through activity-alias+targetActivity, the AMS side still thinks that SplashActivity is started. When MainActivity is started again later, it will think that MainActivity did not exist before, so it will start a MainActivity again. Our solution to this problem is to modify the Component information of the Intent that starts the MainActivity, changing it from MainActivity to SplashActivity. In this way, we completely solve the problem of multiple instances caused by internally starting the MainActivity. In order to intrude as little as possible on the business and prevent the subsequent iterations from causing problems with the MainActivity due to internal startup, we have instrumented the call to Context startActivity. For the call to start MainActivity, the original implementation is called after adding the flag to the Intent and replacing the Component information. The reason for choosing the instrumentation method is that the code structure of Douyin is relatively complex, there are multiple base class Activities, and some base class Activities cannot be directly modified to the code. For businesses that do not have this problem, it can be implemented by rewriting the startActivity method of the base class Activity and Application. 2. Issues with external multi-instance startup The above solution to multiple instances of MainActivity is based on modifying the Intent of the Activity to be started before starting the Activity. This method obviously cannot solve the problem of multiple instances of MainActivity caused by starting MainActivity externally. So do we have other solutions to the problem of multiple instances caused by starting MainActivity externally? Let's go back to the starting point of solving the problem of multiple instances of MainActivity. The reason for avoiding multiple instances of MainActivity is to prevent multiple MainActivity objects from appearing at the same time, which will lead to the execution of the MainActivity life cycle that does not meet expectations. Therefore, as long as it is ensured that there will not be multiple MainActivity objects at the same time, the problem of multiple instances of MainActivity can be solved. To avoid multiple MainActivity objects appearing at the same time, we first need to know whether there is a MainActivity object currently. The idea to solve this problem is relatively simple. We can monitor the life cycle of Activity and increase or decrease the number of MainActivity instances in onCreate and onDestroy of MainActivity. If the number of MainActivity instances is 0, it is considered that there is no MainActivity object currently. After solving the problem of counting the number of MainActivity objects, we need to keep the number of MainActivity objects that exist at the same time always below 1. To solve this problem, we need to review the Activity startup process. When starting an Activity, it will first pass through AMS, and AMS will call the process where the Activity is located. The process where the Activity is located will pass through the Handler of the main thread to post to the main thread, and then create the Activity object through Instrumentation, and execute the subsequent life cycle. For external startup of MainActivity, we can control the part after returning from AMS to the process. Here, we can choose to use Instrumentation's newActivity as the entry point. Specifically, our optimization plan is as follows:
It should be noted that the implementation scheme of hooking Instrumentation here can also be replaced by AppComponentFactory instantiateActivity on higher versions of Android systems. 1.4.2 Deserialization Optimization Another typical optimization in the Douyin Activity stage is the optimization of deserialization. During the use of Douyin, some data will be serialized locally. During the startup process, this data needs to be deserialized, which will affect the startup speed of Douyin. In the previous optimization process, we optimized the block logic from the business level by case, such as asynchronous and snapshot, and achieved good results. However, this method is more troublesome to maintain, and the iteration process often deteriorates. Therefore, we try to optimize by positively optimizing deserialization. The deserialization problem in the startup phase of Douyin is specifically the time-consuming problem of Gson data parsing. Gson is a JSON parsing library launched by Google. It has the advantages of low access cost, convenient use, and good functional scalability. However, it also has a more obvious weakness, that is, it is time-consuming for the first parsing of a model, and as the complexity of the model increases, its time consumption will continue to expand. The time taken for Gson's first analysis is related to its implementation scheme. There is a very important role in Gson's data analysis process, that is, TypeAdapter. For each Class of an object to be analyzed, Gson will first generate a TypeAdapter for it, and then use this TypeAdapter for analysis. Gson's default analysis scheme uses the TypeAdapter created by ReflectiveTypeAdapterFactory, and its creation and analysis process involves a large number of reflection calls. The specific process is:
Therefore, the core of optimizing the time-consuming Gson analysis is to reduce reflection. Here we will introduce some optimization solutions used in Douyin. Custom TypeAdapter Optimization Through the analysis of Gson's source code, we know that Gson's parsing adopts the form of responsibility chain. If there is a TypeAdapterFactory that can handle a Class before ReflectiveTypeAdapterFactory, then it will not be executed to ReflectiveTypeAdapterFactory. The Gson framework supports the injection of custom TypeAdapterFactory. Therefore, one of our optimization solutions is to inject a custom TypeAdapterFactory to optimize the parsing process. This custom TypeAdapterFactory will generate a custom TypeAdapter for each Class to be optimized during compilation. In this TypeAdapter, relevant parsing code will be generated for each field of the Class to avoid reflection. For bytecode processing in the process of generating a custom TypeAdapter, we used the Bytex (https://github.com/bytedance/ByteX/blob/master/README_zh.md), an open source bytecode processing framework developed by the Douyin team. The specific implementation process is as follows:
public class GsonOptTypeAdapterFactory extends BaseAdapterFactory { Optimize ReflectiveTypeAdapterFactory implementation The above custom TypeAdapter method can optimize the initial parsing time of Gson by about 70%, but this solution needs to add parsing code during compilation, which will increase the package size and has certain limitations. For this reason, we also tried to optimize the implementation of the Gson framework. In order to reduce the access cost, we modified the implementation of ReflectiveTypeAdapterFactory by modifying the bytecode. The original ReflectiveTypeAdapterFactory will first reflect all the field information of the Class before parsing the actual data. However, not all fields are used in the actual parsing process. Take the following Person class as an example. Before parsing Person, the three classes of Person, Hometown, and Job will be parsed. However, the actual input may be just a simple name. In this case, parsing Hometown and Job is completely unnecessary. If the implementation of Hometown and Job classes is more complicated, this will lead to more unnecessary time overhead. class Person { Our solution for this kind of situation is "on-demand parsing". Taking the above Person as an example, when we parse the Class structure of Person, we will parse the name field of the basic data type normally. For the hometown and job fields of complex types, we will record their Class types and return a packaged TypeAdapter. When actually parsing the data, if the hometown and job nodes are indeed included, we will parse the Class structure of Hometown and Job. This optimization solution is particularly effective when the Class structure is complex but there are many missing actual data nodes. In the practice of Douyin, the optimization range of some scenarios is close to 80%. Other optimization solutions The above introduces two typical optimization solutions. In the actual optimization process of Douyin, other optimization solutions were also tried, and good optimization effects were achieved in specific scenarios. You can refer to:
1.5 UI rendering optimizationAfter introducing the optimization of the Activity stage, let's take a look at the related optimization of the UI rendering stage. In this stage, we will introduce the related optimization of View loading. Generally speaking, there are two ways to create a View. The first way is to build a View directly through code, and the second way is to use LayoutInflate to load the XML file. Here we will focus on the optimization of LayoutInflate loading XML. LayoutInflate loading XML includes three steps:
These three steps are time-consuming as a whole. At the business level, we can optimize by optimizing the XML level, using ViewStub for on-demand loading, etc. These optimizations can optimize the XML loading time to a certain extent. Here we introduce another more common optimization solution - asynchronous preloading solution. Take the rootview of the fragment in the figure below as an example. It is inflated in the measure phase of UI rendering. There is a certain time gap from application startup to measure. We can use this time to load these views into memory in advance in the background thread, and then read them directly from the memory in the measure phase. x2c solves the lock problem In androidx, AsyncLayoutInflater is already provided for asynchronous loading of XML, but in actual use, it is found that directly using AsyncLayoutInflater is prone to lock problems and even leads to more time-consuming. Through analysis, we found that this is because there is an object lock in LayoutInflate, and even if different LayoutInflate objects are bypassed, there will still be other locks in the AssetManager layer and Native layer. Our solution is xml2code. During the compilation period, the code to create View is generated for the annotated xml file, and then pre-create the View asynchronously. The x2c solution not only solves the problem of multi-threaded locks, but also improves the pre-creation efficiency of View. The solution is currently being polished, and will be introduced in detail after the polishing is completed. LayoutParams problem Asynchronous Inflate In addition to the problem of multi-threaded locking, another problem is the LayoutParams problem. LayoutInflater mainly relies on root parameters for the view LayoutParam. When root is not null, a root-associated LayoutParams will be constructed for the View when inflate, and LayoutParams will be set for it. However, when we perform asynchronous Inflate, we cannot get the root layout. If the incoming root is null, the LayoutParams of the Inflate View will be null. When this View is added to the parent layout, the default value will be used, which will cause the property of the Inflate view to be lost. The solution to this problem is to new a corresponding type of root when preloading to achieve the correct analysis of the inflate view attributes. public View inflate ( XmlPullParser parser , @ Nullable ViewGroup root , boolean attachToRoot ) { Other issues In addition to the multi-threaded lock problem and LayoutParams problem mentioned above, some other problems were encountered during the preloading process, which are as follows:
1.6 Time-consuming message optimization for main threadAbove we have basically introduced the relevant optimizations of the major life cycles of the main thread. In the actual optimization process of Douyin, we found that some time-consuming messages of the main thread that are posted between these life cycles will also affect the startup speed. For example, between Application and Activity, between Activity and UI rendering. These main thread messages will cause our subsequent life cycle to be delayed and the startup speed will be affected. We need to optimize them. 1.6.1 Main thread message schedulingWe can optimize the code in our project more conveniently; however, some are the internal logic of third-party SDKs, which is difficult for us to optimize. Even messages that are conveniently optimized to prevent deterioration in the later stage are very expensive. We try to solve this problem from another perspective, and while optimizing the main thread post message, we adjust the main thread message queue to allow the startup-related messages to be executed first. Our core principle is to determine the core startup path according to the App startup process, and use message queue adjustment to ensure that cold startup scenarios involve relevant messages priority scheduling, thereby improving startup speed. Specifically, it includes the following:
1.6.2 Main thread time-consuming message optimization Through main thread message scheduling, we can solve the impact of main thread message on startup speed to a certain extent, but it also has certain limitations:
Based on these two reasons, we need to optimize the time-consuming messages of the main thread in the startup stage. Generally speaking, most of the time-consuming messages of the main thread are highly business-related. You can directly discover the problem logic through the stack of the main thread output by the trace tool and perform targeted optimization. Here we mainly introduce the optimization of case that other products may also encounter - the time-consuming main thread caused by WebView initialization. During our optimization process, we found that a main thread is time-consuming. The first layer of its call stack is WebViewChromiumAwInit.startChromiumLocked, which is the code in the system Webview. By analyzing the WebView code, it is found that it is posted to the main thread in the ensureChromiumStartedLocked of WebViewChromiumAwInit. It will be executed once in each process cycle. Whether it is called on the main thread or the child thread, it will eventually be post to the main thread, so we cannot solve the problem of lag in the main thread by modifying the calling thread; at the same time, since it is system code, we cannot solve the problem of system code implementation, so we can only try whether it can be optimized from the perspective of use of the business layer. void ensureChromiumStartedLocked ( boolean onMainThread ) { Problem location : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Through subsequent analysis, it was found that the first call to multiple methods such as getStatics, getGeolocationPermission, createWebView, etc. of the WebViewFactoryProvider interface will trigger the ensureChromiumStartedLocked of WebViewChromiumAwInit to post a time-consuming message to the main thread, so our problem becomes the call positioning of the WebViewFactoryProvider-related methods. One way to position is to implement it through instrumentation. Since WebViewFactoryProvider is not a class that the application can directly access, our call to WebViewFactoryProvider must be implemented by calling other codes of the framework. In this case, we need to analyze all the calling points for WebViewFactoryProvider in the framework, and then insert all the calls to these calling points in the application to log output for location. Obviously, this method is costly and is more likely to be missed. In fact, we can adopt a more convenient way for the case of WebViewFactoryProvider. In the previous analysis, we know that WebViewFactoryProvider is an interface, and we obtain it in the Webview application through reflection. Therefore, we can generate a WebViewFactoryProvider object through dynamic proxy method, replace the WebViewFactoryProvider in WebViewFactory, filter the method name in the generated WebViewFactoryProvider class, and output its call stack for our whitelist method. In this way, we finally locate the time-consuming logic that triggers the main thread, which is the acquisition of our WebView UA. Solution It is confirmed that our time-consuming is caused by obtaining WebView UA, and we can solve it by using local cache: considering that WebView UA records information such as the version of the Webview, which will not change in most cases, so we can cache the Webview UA locally and read it directly from the local later. Each time the application cuts to the background, we will get the WebView UA to update it to the local cache to avoid lag during use. The cache scheme may not be updated in time when Webview UA changes, etc. If the real-time requirement for WebView is very high, we can also obtain WebView UA from the child process by calling the child process ContentProvider. Although this will affect the main thread of the child process, it will not affect our foreground process. Of course, this method requires starting a child process and requiring complete Webview UA to read. Compared with the local cache method, it has obvious disadvantages in reading speed. It is not suitable for some scenarios that require reading speed. We can adopt the corresponding solution according to actual needs. 2. Backend task optimizationThe previous cases are basically time-consuming optimizations related to the main thread. In fact, in addition to the direct time-consuming of the main thread, the time-consuming of the background tasks will also affect our startup speed, because they will seize the CPU, io and other resources of our foreground tasks, resulting in the execution time of the foreground tasks becoming longer. Therefore, while we optimize the foreground time, we also need to optimize our backend tasks. Generally speaking, the optimization of backend tasks is highly correlated with specific business, but we can also organize some common optimization principles:
In addition to these general principles, here are two more typical backend task optimization cases in Douyin. 2.1 Process startup optimizationIn the optimization process, we need to pay attention to the running of the current background thread, but also the running of the background process. At present, most applications have push function. In order to reduce background power consumption and avoid the process being killed due to excessive memory, push-related functions will generally be placed in an independent process. If the push process is started in the startup stage, it will also have a relatively large impact on our startup speed. We try to delay the startup of the push process as appropriate to avoid starting in the startup stage. In offline cases, we can filter keywords such as "Start proc" in logcat to find out whether there is a situation where the child process is started in the startup stage, and obtain the component information that triggers the child process to start. For some complex projects or three-party SDKs, even if we know the components of the startup process, it is difficult to locate the specific startup logic. We can insert the component calls of the service, Recevier, and ContentProvider, enter the call stack, and combine the components in "Start proc" to accurately locate our trigger points. In addition to the life processes in manifest, there may also be some forks to out-of-native processes, we can discover such processes through adb shell ps. 2.2 GC suppressionThere is another typical case where background tasks affect startup speed, which is GC. After triggering GC, it may seize our CPU resources and even cause our thread to be suspended. If there are a large number of GCs during startup, our startup speed will be greatly affected. One way to solve this problem is to reduce the execution of our startup code and reduce the application and utilization of memory resources. This solution requires us to transform our code implementation, which is the most fundamental solution to the impact of gc on startup speed. At the same time, we can also reduce the impact of GC on startup speed through the general method of GC suppression. Specifically, it is to suppress some types of GC in the startup stage to achieve the purpose of reducing GC. Recently, the company's Client Infrastructure-App Health team has investigated the GC suppression solution on ART virtual machines. It has tried to optimize the startup speed of the application on some of the company's products. The detailed technical details will be shared on the "ByteDance Terminal Technology" official account after the subsequent polishing is completed. 3. Global optimizationThe cases introduced above are basically optimized for some time-consuming points at a certain stage. In fact, we still have some points that are not so time-consuming in a single time, but the high frequency may affect the global point, such as high frequency functions in our business, such as our class loading, method execution efficiency, etc. Here we will introduce some of the optimization attempts of Douyin in these aspects. 3.1 Class loading optimization3.1.1 ClassLoader optimization First, let’s take a look at a case of optimization of Douyin in class loading. When talking about class loading, we cannot do without the parent delegation mechanism of class loading. Let’s briefly review the class loading process under this mechanism: First, look for the loaded class. If it can be found, it will return directly. If it cannot be found, call the loadClass of parent classloader to search; If the parent clasloader can find the relevant class, it will be returned directly, otherwise findClass is called to class load; protected Class < ? > loadClass ( String name , boolean resolve ) ClassLoader in Android : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Optimization of class loading by ART virtual machines ART virtual machines still follow the principle of parent delegation in class loading, but have made some optimizations in implementation. Generally speaking, its general process is as follows:
It can be seen that when there is only PathClassLoader on the ClassLoadeer link from PathClassLoader, after the findLoadedClass method of the java layer is called, it not only searches the loaded class with its literal meaning, but also directly loads the class through DexFile in the native layer. This method can reduce an unnecessary jni call compared to returning to the java layer to call findClass and then calling back to the native layer to load through DexFile, which is a higher running efficiency. This is an optimization of the class loading efficiency of the art virtual machine. ClassLoader model in TikTok Previously, we introduced the class loading mechanism in Android. So what optimizations have we made in class loading? To answer this question, we need to understand the ClassLoader model in Douyin. In order to reduce the volume of the package in Douyin, we dynamically distributed some non-core functions through plug-in. After connecting to the plug-in framework, the ClassLoader model in Douyin is as follows:
This ClassLoader model has a very obvious advantage, that is, it can easily support class isolation, reuse, and switch between plug-in and component;
ART class loading optimization mechanism is broken The above introduces the advantages of TikTok's ClassLoader model, but it also has a relatively hidden disadvantage, that is, it will destroy the optimization mechanism of ART virtual machines for class loading. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Non-invasive optimization solution: Delayed injection Understanding the reasons why plug-in causes negative class loading, the optimization idea is clearer - remove the DelegateClassLoader from between PathCLasLoader and BootClassLoader. Through the previous analysis, we know that the introduction of DelegateClassLoader is to use PluginClassloader to load in the plug-in when using PathClassLoader loadClass. Therefore, for scenarios where plug-in is not used, DelegateClassloader is completely unnecessary. We can inject DelegateClassloader when we need to use plug-in functions. However, in actual execution, it will be more difficult to perform full on-demand injection because we cannot accurately grasp the timing of plug-in loading. For example, we may implicitly depend on and load the plug-in classes through compileonly, or we may use a plug-in view in XML to trigger the loading of plug-in. If we want to adapt, it will cause a relatively large intrusion to business development. Here we try to optimize it with another idea - although we cannot accurately know the time of plug-in loading, we can know where there is no plug-in loading. For example, there is no plug-in loading in the Application stage, so we can wait for the Applicaiton stage to complete before injecting the DelegateClassloader. In fact, during the startup process, the loading of the class is mainly concentrated in the Application stage. By completing the execution of the Applicaiton, the DelegateClassloader can be injected greatly reduce the impact of the plug-in solution on the startup speed, and can also avoid intrusion into the business. Invasive optimization solution: Transforming the ClassLoader model The above solution does not require intrusion into the business and the transformation cost is very small, but it only optimizes the class loading in the Application stage. The optimization of class loading by ART in the subsequent stage is still unenjoyable. From the perspective of extreme performance, we have further optimized. The core idea of our optimization is to completely remove the DelegateClassloader from the PathClassLoader and BootClassLoader, and solve the problem of host loading plug-in classes through other methods. Through analysis, we can know that there are several main ways for host loading plug-in classes:
Therefore, our problem becomes how to implement these four categories of host loading plugins without injecting ClassLoader. First of all, the Class.forName method. The most direct solution to the problem of plug-in class loading in this method is to specify the ClassLoader as the DelegateClassloader when calling Class.forName. However, this method is not friendly enough for business development and there are some problems that we cannot modify in the three-party SDK. Our final solution is to bytecode instrumentation on the Class.forName call, and try to use DelegateClassloader to load when the class load fails. Next is the implicit dependency of compileOnly, which is more difficult to generalize, because we cannot find a suitable time to guarantee the failure of class loading. Our solution to this problem is to transform the business, change the implicit dependency call method of compileOnly to use Class.forName. The reason for such transformation is mainly based on several considerations:
The problem of loading the four major component classes of the plug-in and using the plug-in class in xml can be solved through the same solution - replace the ClassLoadedApk with the DelegateClassLoader, so that whether it is loading the four major component classes or the class loading when LayoutInflate loads xml, it will be loaded using the DelegateClassLoader. For the principles of this part, you can refer to the analysis of related plug-in principles such as DroidPlugin, Replugin, etc., and I will not introduce it here. 3.1.2 Class verify optimization For the optimization of ClassLoader, it is optimized in the load stage during the class loading process. It can also be optimized for other stages of class loading. A typical case is the optimization of class. The classverify process mainly checks whether class complies with Java specifications. If it does not comply with the specifications, verification-related exceptions will be thrown in the verification stage. Generally speaking, class in Android will be verified when the application is installed or the plug-in is loaded, but there are some specific cases, such as plug-ins after Android 10, plug-in compilation adopts the extract filter type, and host and plug-in dependence cause static verification failure, etc., so verify at runtime. In addition to verifying the class, the process of running verification will also trigger the load of the class it depends on, which will cause time-consuming. In fact, classverify mainly checks the bytecode issued by the network. For our plug-in code, it will verify the legality of the class during the compilation process. Moreover, even if an illegal class really occurs, at most it will transfer the exception thrown in the verification stage to the class when it is used. Therefore, we can think that classverify at runtime is unnecessary. You can optimize the loading of these classes by turning off classverify. Regarding closing classverify, there are some excellent solutions in the industry, such as locating the memory address where verification_ is located in the memory when the runtime is set to skip verification mode to achieve skipping classverify. // If kNone, verification is disabled. kEnable by default. Of course, closing the optimization solution of classverify does not necessarily have value for all applications. Before optimization, you can output the class information of classverify at runtime in the host and plug-in through the oatdump command. For situations where a large number of classes verify at runtime, you can use the above solution to optimize. oatdump -- oat - file = xxx . odex > dump . txt 3.2 Other global optimizationsIn terms of global optimization, there are some other more general optimization solutions. Here are some simple introductions for your reference:
Summary and OutlookSo far, we have introduced the typical and general cases in Douyin startup optimization. I hope these cases can provide some reference for everyone's startup optimization. Looking back at all Douyin's previous startup-related optimizations, general optimization only accounts for a small part of them, and more are business-related optimizations. This part of the optimization has a strong business correlation and other businesses cannot be directly migrated. Finally, we will summarize and prospect our startup optimization from the perspective of practice, hoping that it will be helpful to everyone. Continuous iterationStarting optimization is a process that requires continuous iteration and polishing. Generally speaking, the beginning is a "fast and fierce" rapid optimization stage. This stage will have a relatively large optimization space, and the optimization granularity will be relatively coarse, and good returns can be achieved when there is not much manpower. The second stage is the difficult and difficult stage. The investment required in this stage is greater than that in the first stage, and the final improvement effect also depends on the difficult and difficult stage; the third stage is the process of preventing deterioration and continuous refined optimization. This process is the most lasting process. For products that are rapidly iterated, this stage is also very important and is the only way for us to achieve extreme startup performance. Scenario generalizationStarting optimization also requires certain expansion and generalization. Generally speaking, we focus on the time when the user clicks on the icon to the homepage first frame, but with the increase of commercial opening and push clicking scenarios, we also need to expand to these scenes. In addition, many times, although the first frame of the page comes out, users still cannot see what they want to see, because the user may not focus on the time of the first frame of the page, but the time when the effective content is loaded. Taking Douyin as an example, while we pay attention to the startup speed, we will also pay attention to the time of the first frame of the video. From the AB experiment, this indicator is even more important than the startup speed. Other products can also combine their own business to define some corresponding indicators to verify the impact on the user experience, and decide whether it is necessary to optimize. Global awarenessGenerally speaking, we measure startup performance by startup speed. In order to improve startup speed, we may delay or on demand some tasks that were originally performed in the startup stage. This method can effectively optimize startup speed, but it may also damage the subsequent user experience. For example, if the background tasks in a certain startup stage are postponed to subsequent use, if the first use is in the main thread, it may cause lag. Therefore, while we focus on startup performance, we also need to pay attention to other indicators that may affect. In terms of performance, we need to have a macro indicator that can reflect global performance to prevent local optimal effects. In terms of business, we need to establish a relationship between startup performance and business. Specifically, in the optimization process, we support AB capabilities as much as possible for some larger startup optimizations. On the one hand, we can realize qualitative analysis of optimization to prevent some negative optimizations that have local performance benefits but are harmful to the global experience from being brought online; on the other hand, we can also use the qualitative analysis capabilities of the experiment to quantify the effect of each optimization on the business, thereby providing guidance for subsequent optimization directions. At the same time, we can also provide rollback capabilities to stop losses in a timely manner for some changes that may cause stability or functional abnormalities. At present, Volcano Engine, an enterprise-level technical service platform under ByteDance, has opened its AB experimental capabilities to the public. Interested students can go to the Volcano Engine official website to learn more. Full coverage and refined operationsIn the future, Douyin's startup optimization has two major goals. The first goal is to maximize the coverage rate of startup optimization: in terms of architecture, we hope that the code in the startup stage can rely simply and clear, the module granularity is as small as possible, and the subsequent optimization and iteration cost is low; in terms of experience, while optimizing performance, it is necessary to optimize functions such as interaction and content quality to improve the reach efficiency and quality of functions; in terms of scenarios, it is necessary to fully cover various startup methods, landing pages such as cold start, warm start, and hot start; in terms of optimization directions, it covers various optimization directions such as CPU, IO, memory, lock, and UI rendering. The second goal is to achieve refined operation of startup optimization, achieve thousands of people, and use different startup strategies for different users, different equipment performance and conditions, and different startup scenarios to maximize experience optimization. |
>>: Apple's press conference revealed that iOS 16 is here: more custom features and faster speed
Mourinho's market-watching hot money model fo...
We often see "warm tips" about hot pot ...
[[122682]] iPad, iPhone and Android users are get...
Product data reporting is an essential task for p...
It is said that "cover the belly button and ...
On the eve of its listing, Xiaomi held a press co...
In 1859, Austrian botanist Frederich Welwitsch di...
New media operation is not easy, and everyone who...
On Xiaohongshu, where content is king, being able...
You may be unfamiliar with the Velar, but when yo...
What is the relationship between Qutoutiao and To...
Over the past decade, the Chinese have transforme...
It is easy to go up to the sky but difficult to g...
In daily life, many of our actions or habits can ...
The situation of overseas shopping products has b...