Tik Tok Android Performance Optimization Series: Startup Optimization Practice

Startup performance is the face of the app user experience. A long startup process is likely to reduce users' interest in using the app. Douyin has also verified its significant impact on business indicators through degradation experiments on startup performance. Douyin has hundreds of millions of daily active users, and an increase of a few hundred milliseconds in startup time may lead to a reduction in retention of tens of thousands of users. Therefore, optimizing startup performance has become the top priority of Douyin's Android basic technology team in the direction of experience optimization.

In the previous article on startup performance optimization: theory and tools, we have introduced the startup performance optimization of Douyin from the perspectives of principles, methodology, and tools. This article will introduce the solutions and ideas for Douyin startup optimization from a practical perspective through specific case analysis.

Preface

Startup refers to the entire process from when a user clicks an icon to when they see the first frame of the page. The goal of startup optimization is to reduce the time spent in this process.

The startup process is relatively complex. In terms of processes and threads, it involves multiple cross-process communications and switching between multiple threads. In terms of time-consuming causes, it includes CPU, CPU scheduling, IO, lock waiting and other types of time-consuming. Although the startup process is relatively complex, we can eventually abstract it into a linear process of the main thread. Therefore, the optimization of startup performance is to shorten this linear process of the main thread.

Next, we will introduce some typical cases encountered by the team in the practice of startup optimization according to the logic of direct optimization of the main thread, indirect optimization of the background thread, and global optimization. In the meantime, we will also briefly introduce some of the better solutions in the industry.

Optimization case analysis

1. Main thread direct optimization

For the optimization of the main thread, we will introduce it in the order of the life cycle.

1.1 MutilDex Optimization

First, let's look at the first stage, which is the attachBaseContext stage of Application. Due to issues such as Application Context assignment, this stage generally does not have too much business code and is not expected to take much time. However, during actual testing, we found that on some models, the first launch of the application after installation takes a very long time. After preliminary positioning, it was found that the main time-consuming process was MultiDex.install.

After detailed analysis, we determined that the issue is concentrated on 4.x models, affecting the first installation and the first startup after subsequent updates.

The reason for this problem is that the instruction format of dex is not well designed. The total number of Java methods referenced in a single dex file cannot exceed 65536. If the number of methods exceeds 65536, it will be split into multiple dex files. Generally, the Dalvik virtual machine can only execute optimized odex files. In order to improve the installation speed of applications on 4.x devices, it will only optimize the first dex of the application during the installation phase. For non-first dex, it will optimize when calling MultiDex.install for the first time, and this optimization is very time-consuming, which causes the problem of slow first startup on 4.x devices.

There are several necessary conditions for this problem, namely: dex is split into multiple dex files, only the first dex is optimized during installation, MultiDex.install is called during startup, and the Dalvik virtual machine needs to load odex.

Obviously, we cannot break the first two conditions - for TikTok, it is difficult for us to optimize it to have only a single dex, and we cannot change the system installation process. The condition of calling MultiDex.install at the startup stage is also difficult to break - first, as the business expands, it is difficult for us to use a single dex to carry the startup code; second, even if we do, it is difficult to maintain it later.

Therefore, we choose to break the restriction that "Dalvik virtual machine needs to load odex", that is, bypass the restriction of Dalvik and directly load the unoptimized dex. The core of this solution is the native function Dalvik_dalvik_system_DexFile_openDexFile_bytearray, which supports loading unoptimized dex files. The specific optimization plan is as follows:

First, decompress the APK to get the bytecode of the original non-first dex file;
Call Dalvik_dalvik_system_DexFile_openDexFile_bytearray, pass in the DEX bytecodes obtained from APK one by one, complete DEX loading, and get a legal DexFile object;
Add all DexFiles to the DexPathList of the APP's PathClassLoader;
Defer asynchronous odex optimization for non-first dex.

For more details about MutilDex optimization, please refer to a previous public account article. The solution is now open source. For details, please see the github address of the project (https://github.com/bytedance/BoostMultiDex).

1.2 ContentProvider Optimization

Next, we will introduce the optimization of ContentProvider. As one of the four major components of Android, ContentProvider is unique in terms of life cycle. Activity, Service, BroadcastReceiver are instantiated and execute their life cycle only when they are called. Even if ContentProvider is not called, it will be automatically instantiated and execute the relevant life cycle at the startup stage. After calling the attachBaseContext method of Application in the initialization stage of the process, the installContentProviders method will be executed to install all ContentProviders of the current process.

This process will instantiate all ContentProviders of the current process one by one through a for loop, call their attachInfo and onCreate lifecycle methods, and finally publish the ContentProviderHolder associated with these ContentProviders to the AMS process at one time.

The feature of ContentProvider that is automatically initialized during the process initialization phase makes it a cross-process communication component and also used by some modules for automatic initialization. The most typical of these is the official Lifecycle component, which is initialized with the help of a ContentProvider called ProcessLifecycleOwnerInitializer.

The initialization of LifeCycle only registers the LifecycleCallbacks of the Activity, which does not take much time. We do not need to do too much optimization at the logical level. It is worth noting that if there are many ContentProviders used for initialization, the creation and lifecycle execution of the ContentProvider itself will be very time-consuming. To address this problem, we can use the Startup provided by JetPack to aggregate multiple initialized ContentProviders into one for optimization.

In addition to this type of ContentProvider that takes very little time, we also found some ContentProviders that take a long time during the actual optimization process. Here is a brief introduction to our optimization ideas.

 public class ProcessLifecycleOwnerInitializer extends ContentProvider {
 @Override 
 public boolean onCreate () {
 LifecycleDispatcher . init ( getContext ());
 ProcessLifecycleOwner . init ( getContext ());
 return true ;
 }
 }

For our own ContentProvider, if initialization takes time, we can change automatic initialization to on-demand initialization by refactoring. For some third-party or even official ContentProviders, it is not possible to optimize them directly by refactoring. Here we take the official FileProvider as an example to introduce our optimization ideas.

FileProvider Usage

FileProvider is a component introduced in Android 7.0 for file access control. Before the introduction of FileProvider, we directly passed the file Uri for some cross-process file operations such as taking photos. After the introduction of FileProvider, our entire process is:

First, inherit FileProvider to implement a custom FileProvider, register this Provider in the manifest, and associate a file path xml file with its FILE_PROVIDER_PATHS property;
Usage method: Use the getUriForFile method of FileProvider to convert the file path into Content Uri, and then call the query, openFile and other methods of ContentProvider.
When FileProvider is called, it will first check the file path to see if it is in the XML defined in step 1. If the file path check passes, the subsequent logic will continue to execute.

Time consumption analysis

From the above process, as long as we do not call FileProvider in the startup phase, there will be no FileProvider-related time consumption. But in fact, from the startup trace, we can see that there is FileProvider-related time consumption in our startup phase, and the specific time consumption is in the attachInfo method of FileProvider's life cycle. In addition to calling the onCreate method that we are most familiar with, the attachInfo method of FileProvider will also call the getPathStrategy method. Our time consumption is concentrated in this getPathStrategy method.

From the implementation point of view, the getPathStrategy method mainly parses the XML file associated with FileProvider, and the parsing result will be assigned to the mStrategy variable. Further analysis will find that mStrategy will be used when FileProvider's query, getType, openFile and other interfaces perform file path verification, while our query, getType, openFile and other interfaces will not be called during the startup phase. Therefore, the getPathStrategy in the FileProvider attachInfo method is completely unnecessary. We can completely execute the getPathStrategy logic when the query, getType, openFile and other interfaces are called.

Optimization plan

FileProvider is the code in androidx, we cannot modify it directly, but it will participate in our code compilation. We can modify its implementation by modifying the bytecode during the compilation phase. The specific implementation plan is:

The attachInfo method of ContentProvider is instrumented. Before executing the original implementation, the grantUriPermissions of the parameter ProviderInfo is set to false. Then the original implementation is called and exceptions are captured. After the call is completed, the grantUriPermissions of ProviderInfo is set back to true. The grantUriPermissions check is used to bypass the execution of getPathStrategy. (The reason why the exported exception detection of ProviderInfo is not used here to bypass the getPathStrategy call is because the exported attribute of ProviderInfo is cached in the super method of attachInfo)

 public void attachInfo ( @ NonNull Context context , @ NonNull ProviderInfo info ) {
 super . attachInfo ( context , info );

 // Sanity check our security
 if ( info . exported ) {
 throw new SecurityException ( "Provider must not be exported" );
 }
 if ( ! info . grantUriPermissions ) {
 throw new SecurityException ( "Provider must grant uri permissions" );
 }

 mStrategy = getPathStrategy ( context , info . authority );
 }

Instrument the query, getType, openFile and other methods of FileProvider. Initialize getPathStrategy before calling the original method, and then call the original implementation after initialization.

Although a single FileProvider does not take much time, some large apps may have multiple FileProviders for module decoupling. In this case, the benefits of FileProvider optimization are still considerable. Similar to FileProvider, the WorkManager provided by Google also has an initialized ContentProvider, and we can use a similar method to optimize it.

1.3 Startup task reconstruction and task scheduling

The third stage of startup is the onCreate stage of Application. This stage is the peak stage of startup task execution. The optimization of this stage is aimed at the optimization of various startup tasks and has a strong business relevance. Here is a brief introduction to our general optimization ideas.

The core idea of Douyin startup task optimization is to maximize code value and resource utilization. Code value maximization is mainly to determine which tasks should be executed in the startup phase, and its core goal is to remove tasks that should not be executed in the startup phase from the startup phase; resource utilization maximization is to use system resources as much as possible to reduce the time consumption of task execution when the tasks in the startup phase have been determined. For a single task, we need to optimize its internal implementation, reduce its own resource consumption to provide more resources for other tasks to execute, and for multiple tasks, we can make full use of system resources through reasonable scheduling.

From the implementation perspective, we mainly focus on two things: starting task reconstruction and task scheduling.

Startup Task Refactoring

Due to the high complexity of the business and the relatively loose control over startup tasks in the early stage, there are more than 300 tasks in the startup phase of Douyin. In this case, scheduling the tasks in the startup phase can improve the startup speed to a certain extent, but it is still difficult to increase the startup speed to a higher level. Therefore, a very important direction in startup optimization is to reduce startup tasks.

For this reason, we divide the startup tasks into three categories: configuration tasks, preloading tasks, and functional tasks. Configuration tasks are mainly used to initialize various SDKs. Before they are executed, the related SDKs cannot work; preloading tasks are mainly used to preheat certain subsequent functions to improve the execution speed of subsequent functions; functional tasks are function-related tasks executed during the life cycle of process startup. We use different transformation methods for these three types of tasks:

Configuration tasks: For configuration tasks, our ultimate goal is to remove them from the startup phase. There are two main reasons for this. First, some configuration tasks are still time-consuming. Removing them from the startup tasks can improve our startup speed. Secondly, the relevant SDK cannot be used normally before the configuration task is executed, which will affect our functional availability, stability, and scheduling during the optimization process. In order to achieve the purpose of removing configuration tasks, we have atomically transformed the configuration tasks, and changed the implementation of injecting context, callback and other parameters into the SDK through the SPI (service discovery) method. For Douyin's own code, when we need to use context, callback and other parameters, we request the upper layer of the application to obtain them through the SPI method. For third-party SDKs whose codes we cannot modify, we encapsulate them in an intermediate layer. The subsequent use of third-party SDKs is through the encapsulated intermediate layer, and the SDK configuration tasks are executed when the relevant interfaces of the intermediate layer are called. In this way, we can remove the configuration tasks from the startup phase and implement them on demand when they are used.
Preloading tasks: We standardized the preloading tasks to ensure the correctness of their functions when they are downgraded. We also removed expired preloading tasks and redundant logic in preloading tasks to improve the value of preloading tasks.
Functional tasks: For functional startup tasks, we decompose and slim them down to their granularity, remove non-essential logic in the startup phase, and add scheduling and downgrading capabilities to functional tasks for subsequent scheduling and downgrading.

Task Scheduling

There have been many introductions to task scheduling in the industry. Here, we will not introduce task dependency analysis, task arrangement, etc., but mainly introduce some possible innovations of Douyin in practice:

Scheduling based on landing pages: In addition to entering the homepage, Douyin also has different landing pages such as authorization login and push activation. These different landing pages have relatively large differences in task execution. In the Application stage, we can obtain the target page to be started by reflecting the messages in the main thread message queue, and perform targeted task scheduling based on the landing page;
Scheduling based on device performance: Collect various performance data of devices, score and normalize the devices in the background, and send the normalized results to the end, which schedules tasks according to the performance level.
Scheduling based on function activity: Statistics on the user's use of each function are collected, and activity data for each function is calculated for the user. The data is sent to the client, and the client performs scheduling based on the function activity.
Scheduling based on end intelligence: Use end intelligence to predict the user's subsequent behavior on the end and preheat subsequent functions.
Startup function downgrade: For some devices and users with poor performance, the tasks and functions in the startup phase are downgraded and postponed to after startup, or even not executed at all, to ensure the overall experience.

1.4 Activity Phase Optimization

The previous stages all belong to the Application stage. Next, let's look at the related optimizations in the Activity stage. In this stage, we will introduce two typical examples: merging Splash and Main and deserialization optimization.

1.4.1 Splash and Main merge

First, let's look at the merger of SplashActivity and MainActivity. In previous versions, the launcher activity of Douyin was SplashActivity, which mainly carried the logic related to opening the screen such as advertisements and activities. In general, our startup process is:

Enter SplashActivity and determine whether there is a splash screen to be displayed in SplashActivity;
If there is a splash screen to be displayed, display it, wait for the splash screen display to end and then jump to MainActivity. If there is no splash screen, jump directly to MainActivity.

In this process, our startup needs to go through the startup of two Activities. If these two Activities are merged, we can gain two benefits:

Reduce the Activity startup process once;
Use the time of reading the opening screen information to do some concurrent tasks that are strongly related to the Activity, such as asynchronous View preloading.

To merge Splash and Main, we need to solve two main problems:

How to solve the problem of external redirection through Activity name after merging;
If you solve the problem of LaunchMode and multiple instances.

The first problem is relatively easy to solve. We can use activity-alias+targetActivity to point SplashActivity to MainActivity. Next, let's look at the second problem.

The launchMode Problem

Before Splash and Main are merged, the LaunchMode of SplashActivity and MainActivity are standard and sinngletask respectively. In this case, we can ensure that there is only one instance of MainActivity, and when we exit the application home and re-enter, we can return to the previous page.

After merging SplashActivity with MainActivity, our launcher Activity becomes MainActivity. If we continue to use singletask as the launchMode, when we exit the secondary page home and click the icon again, we will not be able to return to the secondary page, but will return to the Main page. Therefore, after the merger, the launch mode of MainActivity will no longer be able to use singletask. After research, we finally chose to use singletop as our launchMode.

Multiple instance issues

1. Issues with starting multiple instances internally

Although using singletop can solve the problem of not being able to return to the previous page after exiting home and re-entering, it also brings the problem of multiple instances of MainActivity. In Douyin's logic, there are some logics that are strongly related to the life cycle of MainActivity. If there are multiple instances of MainActivity, this part of the logic will be affected. At the same time, the implementation of multiple MainActivities will also lead to unnecessary resource overhead, which is not in line with expectations. Therefore, we hope to solve this problem.

Our solution to this problem is to add the FLAG_ACTIVITY_NEW_TASK and FLAG_ACTIVITY_CLEAR_TOP flags to all Intents that start MainActivity in the application to achieve a clear top feature similar to singletask.

Using the FLAG_ACTIVITY_NEW_TASK + FLAG_ACTIVITY_CLEAR_TOP solution, we can basically solve the problem of starting multiple instances of MainActivity internally. However, during the actual test, we found that on some systems, even if the clear top feature is implemented, the problem of multiple instances still exists.

After analysis, we found that on this part of the system, even if SplashActivity is pointed to MainActivity through activity-alias+targetActivity, the AMS side still thinks that SplashActivity is started. When MainActivity is started again later, it will think that MainActivity did not exist before, so it will start a MainActivity again.

Our solution to this problem is to modify the Component information of the Intent that starts the MainActivity, changing it from MainActivity to SplashActivity. In this way, we completely solve the problem of multiple instances caused by internally starting the MainActivity.

In order to intrude as little as possible on the business and prevent the subsequent iterations from causing problems with the MainActivity due to internal startup, we have instrumented the call to Context startActivity. For the call to start MainActivity, the original implementation is called after adding the flag to the Intent and replacing the Component information. The reason for choosing the instrumentation method is that the code structure of Douyin is relatively complex, there are multiple base class Activities, and some base class Activities cannot be directly modified to the code. For businesses that do not have this problem, it can be implemented by rewriting the startActivity method of the base class Activity and Application.

2. Issues with external multi-instance startup

The above solution to multiple instances of MainActivity is based on modifying the Intent of the Activity to be started before starting the Activity. This method obviously cannot solve the problem of multiple instances of MainActivity caused by starting MainActivity externally. So do we have other solutions to the problem of multiple instances caused by starting MainActivity externally?

Let's go back to the starting point of solving the problem of multiple instances of MainActivity. The reason for avoiding multiple instances of MainActivity is to prevent multiple MainActivity objects from appearing at the same time, which will lead to the execution of the MainActivity life cycle that does not meet expectations. Therefore, as long as it is ensured that there will not be multiple MainActivity objects at the same time, the problem of multiple instances of MainActivity can be solved.

To avoid multiple MainActivity objects appearing at the same time, we first need to know whether there is a MainActivity object currently. The idea to solve this problem is relatively simple. We can monitor the life cycle of Activity and increase or decrease the number of MainActivity instances in onCreate and onDestroy of MainActivity. If the number of MainActivity instances is 0, it is considered that there is no MainActivity object currently.

After solving the problem of counting the number of MainActivity objects, we need to keep the number of MainActivity objects that exist at the same time always below 1. To solve this problem, we need to review the Activity startup process. When starting an Activity, it will first pass through AMS, and AMS will call the process where the Activity is located. The process where the Activity is located will pass through the Handler of the main thread to post to the main thread, and then create the Activity object through Instrumentation, and execute the subsequent life cycle. For external startup of MainActivity, we can control the part after returning from AMS to the process. Here, we can choose to use Instrumentation's newActivity as the entry point.

Specifically, our optimization plan is as follows:

Inherit Instrumentation to implement a custom Instrumentaion class and rewrite all methods in it in proxy forwarding mode;
Reflection obtains the Instrumentaion object in ActivityThread, and uses it as a parameter to create a custom Instrumentaion object, and replaces the original Instrumentaion of ActivityThread with the custom Instrumentaion object through reflection;
In the newActivity method of the custom Instrumentation class, determine whether the current Activity to be created is MainActivity. If it is not MainActivity or the MainActivity object does not exist, call the original implementation. Otherwise, replace its className parameter to point it to an empty Activity to create an empty Activity.
Finish yourself in onCreate of this empty Activity, and start SplashActivity through an Intent with FLAG_ACTIVITY_NEW_TASK and FLAG_ACTIVITY_CLEAR_TOP flags added.

It should be noted that the implementation scheme of hooking Instrumentation here can also be replaced by AppComponentFactory instantiateActivity on higher versions of Android systems.

1.4.2 Deserialization Optimization

Another typical optimization in the Douyin Activity stage is the optimization of deserialization. During the use of Douyin, some data will be serialized locally. During the startup process, this data needs to be deserialized, which will affect the startup speed of Douyin. In the previous optimization process, we optimized the block logic from the business level by case, such as asynchronous and snapshot, and achieved good results. However, this method is more troublesome to maintain, and the iteration process often deteriorates. Therefore, we try to optimize by positively optimizing deserialization.

The deserialization problem in the startup phase of Douyin is specifically the time-consuming problem of Gson data parsing. Gson is a JSON parsing library launched by Google. It has the advantages of low access cost, convenient use, and good functional scalability. However, it also has a more obvious weakness, that is, it is time-consuming for the first parsing of a model, and as the complexity of the model increases, its time consumption will continue to expand.

The time taken for Gson's first analysis is related to its implementation scheme. There is a very important role in Gson's data analysis process, that is, TypeAdapter. For each Class of an object to be analyzed, Gson will first generate a TypeAdapter for it, and then use this TypeAdapter for analysis. Gson's default analysis scheme uses the TypeAdapter created by ReflectiveTypeAdapterFactory, and its creation and analysis process involves a large number of reflection calls. The specific process is:

First, obtain all Fields of the object to be parsed through reflection, and read their annotations one by one to generate a mapping map from serializeName to Field;
During the parsing process, the corresponding Field information is found in the generated map through the read serializeName, and then it is parsed in a specific type according to the data type of Field, and then assigned through reflection.

Therefore, the core of optimizing the time-consuming Gson analysis is to reduce reflection. Here we will introduce some optimization solutions used in Douyin.

Custom TypeAdapter Optimization

Through the analysis of Gson's source code, we know that Gson's parsing adopts the form of responsibility chain. If there is a TypeAdapterFactory that can handle a Class before ReflectiveTypeAdapterFactory, then it will not be executed to ReflectiveTypeAdapterFactory. The Gson framework supports the injection of custom TypeAdapterFactory. Therefore, one of our optimization solutions is to inject a custom TypeAdapterFactory to optimize the parsing process.

This custom TypeAdapterFactory will generate a custom TypeAdapter for each Class to be optimized during compilation. In this TypeAdapter, relevant parsing code will be generated for each field of the Class to avoid reflection.

For bytecode processing in the process of generating a custom TypeAdapter, we used the Bytex (https://github.com/bytedance/ByteX/blob/master/README_zh.md), an open source bytecode processing framework developed by the Douyin team. The specific implementation process is as follows:

Configure the Class to be optimized: During the development phase, whitelist the Class we need to optimize through annotations and configuration files;
Collect information about the classes to be optimized: After starting compilation, we read the classes configured by the configuration file from the configuration file; in the traverse phase of traversing all the classes in the project, we use the ClassVisitor provided by ASM to read the classes configured by annotations. For all classes that need to be optimized, we use the visitField method of the ClassVisitor to collect all the field information of the current class;
Generate custom TypeAdapter and TypeAdapterFactory: In the trasform phase, we use the collected Class and Field information to generate custom TypeAdapter classes, and also generate custom TypeAdapterFactory to create these TypeAdapters;

 public class GsonOptTypeAdapterFactory extends BaseAdapterFactory {

 protected BaseAdapter createTypeAdapter ( String var1 ) {
 switch ( var1 . hashCode ()) {
 Case - 1939156288 :
 if ( var1 . equals ( "xxx/xxx/gsonopt/model/Model1" )) {
 return new TypeAdapterForModel1 ( this . gson );
 }
 break ;
 Case - 1914731121 :
 if ( var1 . equals ( "xxx/xxx/gsonopt/model/Model2" )) {
 return new TypeAdapterForModel2 ( this . gson );
 }
 break ;
 return null ;
 }
 }

 public abstract class TypeAdapterForModel1 extends BaseTypeAdapter {

 protected void setFieldValue ( String var1 , Object var2 , JsonReader var3 ) {
 Object var4 ;
 switch ( var1 . hashCode ()) {
 case 110371416 :
 if ( var1 . equals ( "field1" )) {
 var4 = this . gson . getAdapter ( String . class ). read ( var3 );
 (( Model1 ) var2 ). field1 = ( String ) var4 ;
 return true ;
 }
 break ;
 case 1223751172 :
 if ( var1 . equals ( "filed2" )) {
 var4 = this . gson . getAdapter ( String . class ). read ( var3 );
 (( Model1 ) var2 ). field2 = ( String ) var4 ;
 return true ;
 }
 }
 return false ;
 }
 }

Optimize ReflectiveTypeAdapterFactory implementation

The above custom TypeAdapter method can optimize the initial parsing time of Gson by about 70%, but this solution needs to add parsing code during compilation, which will increase the package size and has certain limitations. For this reason, we also tried to optimize the implementation of the Gson framework. In order to reduce the access cost, we modified the implementation of ReflectiveTypeAdapterFactory by modifying the bytecode.

The original ReflectiveTypeAdapterFactory will first reflect all the field information of the Class before parsing the actual data. However, not all fields are used in the actual parsing process. Take the following Person class as an example. Before parsing Person, the three classes of Person, Hometown, and Job will be parsed. However, the actual input may be just a simple name. In this case, parsing Hometown and Job is completely unnecessary. If the implementation of Hometown and Job classes is more complicated, this will lead to more unnecessary time overhead.

 class Person {
 @SerializedName ( value = "name" , alternate = { "nickname" })
 private String name ;
 private Hometown hometown ;
 private Job job ;
 }

 class Hometown {
 private String name ;
 private int code ;
 }

 class Job {
 private String company ;
 private int type ;
 }
 //Actual input
 {
 "name" : "Zhang San"
 }

Our solution for this kind of situation is "on-demand parsing". Taking the above Person as an example, when we parse the Class structure of Person, we will parse the name field of the basic data type normally. For the hometown and job fields of complex types, we will record their Class types and return a packaged TypeAdapter. When actually parsing the data, if the hometown and job nodes are indeed included, we will parse the Class structure of Hometown and Job. This optimization solution is particularly effective when the Class structure is complex but there are many missing actual data nodes. In the practice of Douyin, the optimization range of some scenarios is close to 80%.

Other optimization solutions

The above introduces two typical optimization solutions. In the actual optimization process of Douyin, other optimization solutions were also tried, and good optimization effects were achieved in specific scenarios. You can refer to:

Unify Gson objects: Gson will cache the parsed Class as TypeAdapter, but this cache is at the Gson object level and will not be reused between different Gson objects. By unifying the Gson objects, the reuse of TypeAdapter can be achieved;
Pre-create TypeAdapter: For scenarios with sufficient concurrent space, we create the TypeAdapter of the relevant Class in advance in the asynchronous thread, and then we can directly use the pre-created TypeAdapter for data parsing;
Use other protocols: For serialization and deserialization of local data, we tried to use binary sequential storage, which reduced the deserialization time by 95%. In the specific implementation, we used the Parcel solution provided by Android natively. In the case of cross-version data incompatibility, we rolled back to the version-compatible Gson parsing method through version control.

1.5 UI rendering optimization

After introducing the optimization of the Activity stage, let's take a look at the related optimization of the UI rendering stage. In this stage, we will introduce the related optimization of View loading.

Generally speaking, there are two ways to create a View. The first way is to build a View directly through code, and the second way is to use LayoutInflate to load the XML file. Here we will focus on the optimization of LayoutInflate loading XML. LayoutInflate loading XML includes three steps:

Parse the XML file into the memory XmlResourceParser IO process;
Get the Java reflection process of Class according to the Tag name of XmlResourceParser;
Create a View instance and eventually generate a View tree.

These three steps are time-consuming as a whole. At the business level, we can optimize by optimizing the XML level, using ViewStub for on-demand loading, etc. These optimizations can optimize the XML loading time to a certain extent.

Here we introduce another more common optimization solution - asynchronous preloading solution. Take the rootview of the fragment in the figure below as an example. It is inflated in the measure phase of UI rendering. There is a certain time gap from application startup to measure. We can use this time to load these views into memory in advance in the background thread, and then read them directly from the memory in the measure phase.

x2c solves the lock problem

In androidx, AsyncLayoutInflater is already provided for asynchronous loading of XML, but in actual use, it is found that directly using AsyncLayoutInflater is prone to lock problems and even leads to more time-consuming.

Through analysis, we found that this is because there is an object lock in LayoutInflate, and even if different LayoutInflate objects are bypassed, there will still be other locks in the AssetManager layer and Native layer. Our solution is xml2code. During the compilation period, the code to create View is generated for the annotated xml file, and then pre-create the View asynchronously. The x2c solution not only solves the problem of multi-threaded locks, but also improves the pre-creation efficiency of View. The solution is currently being polished, and will be introduced in detail after the polishing is completed.

LayoutParams problem

Asynchronous Inflate In addition to the problem of multi-threaded locking, another problem is the LayoutParams problem.

LayoutInflater mainly relies on root parameters for the view LayoutParam. When root is not null, a root-associated LayoutParams will be constructed for the View when inflate, and LayoutParams will be set for it. However, when we perform asynchronous Inflate, we cannot get the root layout. If the incoming root is null, the LayoutParams of the Inflate View will be null. When this View is added to the parent layout, the default value will be used, which will cause the property of the Inflate view to be lost. The solution to this problem is to new a corresponding type of root when preloading to achieve the correct analysis of the inflate view attributes.

 public View inflate ( XmlPullParser parser , @ Nullable ViewGroup root , boolean attachToRoot ) {
 //Omit other logic
 if ( root != null ) {
 // Create layout params that match root, if supplied
 params = root . generateLayoutParams ( attrs );
 if ( ! attachToRoot ) {
 // Set the layout params for temp if we are not
 // attaching. (If we are, we use addView, below)
 root . setLayoutParams ( params );
 }
 }
 }

 public void addView ( View child , int index ) {
 LayoutParams params = child . getLayoutParams ();
 if ( params == null ) {
 params = generateDefaultLayoutParams ();
 if ( params == null ) {
 throw new IllegalArgumentException ( "generateDefaultLayoutParams() cannot return null" );
 }
 }
 addView ( child , index , params );
 }

Other issues

In addition to the multi-threaded lock problem and LayoutParams problem mentioned above, some other problems were encountered during the preloading process, which are as follows:

The issue of priority of inflate threads: Generally speaking, the priority of background threads will be relatively low. When performing asynchronous inflate, it may be too low to preload or even more time-consuming than not performing preloading. In this case, it is recommended to appropriately increase the priority of asynchronous inflate threads.
Regarding Handler problem: There are some custom Views that will create handlers when they are created. In this case, we need to modify the code that creates the Handler and specify the main thread Looper for it.
There are requirements for threads: Typically, animation is used in a custom View. When the animation starts, it will verify whether it is the main thread of the UI thread. In this case, we need to modify the business code and move the relevant logic to the subsequent time when it is actually added to the View tree.
Scenarios where Activity context is required: One solution is to perform asynchronous preloading after Activity is started. This method does not need to deal with the context of the View specifically, but the preloaded concurrent space may be compressed; another method is to use Applicaiton's context for preloading in the Application stage, but before adding to the view tree, the context of the preloaded View is replaced with the context of the Activity to meet the requirements of the Activity context in scenarios such as Dialog display and LiveData usage.

1.6 Time-consuming message optimization for main thread

Above we have basically introduced the relevant optimizations of the major life cycles of the main thread. In the actual optimization process of Douyin, we found that some time-consuming messages of the main thread that are posted between these life cycles will also affect the startup speed. For example, between Application and Activity, between Activity and UI rendering. These main thread messages will cause our subsequent life cycle to be delayed and the startup speed will be affected. We need to optimize them.

1.6.1 Main thread message scheduling

We can optimize the code in our project more conveniently; however, some are the internal logic of third-party SDKs, which is difficult for us to optimize. Even messages that are conveniently optimized to prevent deterioration in the later stage are very expensive. We try to solve this problem from another perspective, and while optimizing the main thread post message, we adjust the main thread message queue to allow the startup-related messages to be executed first.

Our core principle is to determine the core startup path according to the App startup process, and use message queue adjustment to ensure that cold startup scenarios involve relevant messages priority scheduling, thereby improving startup speed. Specifically, it includes the following:

Create a custom Printer Replace the original Printer through the setMessageLogging interface of Looper and forward the original Printer;
Update the next message to be scheduled in the onCreate of Application and the onResume of MainActivity. The expected target message after the onCreate of Application is Launch Activity, and the expected message after the onResume of MainActivity is rendered with the relevant doFrame message. In order to narrow the scope of impact, the message dispatch will be disable after the startup is completed or the abnormal path is executed;
The specific execution of message scheduling is carried out in the println method of the custom Printer. The main thread message queue is traversed in the println method, and whether the target message exists in the message queue is determined based on message.what and message.getTarget(). If it exists, it will be moved to the head and executed first;

1.6.2 Main thread time-consuming message optimization

Through main thread message scheduling, we can solve the impact of main thread message on startup speed to a certain extent, but it also has certain limitations:

You can only adjust the messages that are already in the message queue. For example, there is a time-consuming main thread message after MainActivity onResme. At this time, the doFrame message has not yet entered the message queue of the main thread. Then we need to execute our time-consuming message before executing the doFrame message, which will still have an impact on the startup speed;
Treat the symptoms but not the root cause. Although we remove the main thread time-consuming message from the startup stage, there will still be lag after startup.

Based on these two reasons, we need to optimize the time-consuming messages of the main thread in the startup stage.

Generally speaking, most of the time-consuming messages of the main thread are highly business-related. You can directly discover the problem logic through the stack of the main thread output by the trace tool and perform targeted optimization. Here we mainly introduce the optimization of case that other products may also encounter - the time-consuming main thread caused by WebView initialization.

During our optimization process, we found that a main thread is time-consuming. The first layer of its call stack is WebViewChromiumAwInit.startChromiumLocked, which is the code in the system Webview. By analyzing the WebView code, it is found that it is posted to the main thread in the ensureChromiumStartedLocked of WebViewChromiumAwInit. It will be executed once in each process cycle. Whether it is called on the main thread or the child thread, it will eventually be post to the main thread, so we cannot solve the problem of lag in the main thread by modifying the calling thread; at the same time, since it is system code, we cannot solve the problem of system code implementation, so we can only try whether it can be optimized from the perspective of use of the business layer.

 void ensureChromiumStartedLocked ( boolean onMainThread ) {
 //Omit other logic
 // We must post to the UI thread to cover the case that the user has invoked Chromium
 // startup by using the (thread-safe) CookieManager rather than creating a WebView.
 PostTask . postTask ( UiThreadTaskTraits . DEFAULT , new Runnable () {
 @Override 
 public void run () {
 synchronized ( mLock ) {
 startChromiumLocked ();
 }
 }
 });
 while ( ! mStarted ) {
 try {
 // Important: wait() releases |mLock| the UI thread can take it :-)
 mLock . wait ();
 } catch ( InterruptedException e ) {
 }
 }
 }

Problem location

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Through subsequent analysis, it was found that the first call to multiple methods such as getStatics, getGeolocationPermission, createWebView, etc. of the WebViewFactoryProvider interface will trigger the ensureChromiumStartedLocked of WebViewChromiumAwInit to post a time-consuming message to the main thread, so our problem becomes the call positioning of the WebViewFactoryProvider-related methods.

One way to position is to implement it through instrumentation. Since WebViewFactoryProvider is not a class that the application can directly access, our call to WebViewFactoryProvider must be implemented by calling other codes of the framework. In this case, we need to analyze all the calling points for WebViewFactoryProvider in the framework, and then insert all the calls to these calling points in the application to log output for location. Obviously, this method is costly and is more likely to be missed.

In fact, we can adopt a more convenient way for the case of WebViewFactoryProvider. In the previous analysis, we know that WebViewFactoryProvider is an interface, and we obtain it in the Webview application through reflection. Therefore, we can generate a WebViewFactoryProvider object through dynamic proxy method, replace the WebViewFactoryProvider in WebViewFactory, filter the method name in the generated WebViewFactoryProvider class, and output its call stack for our whitelist method. In this way, we finally locate the time-consuming logic that triggers the main thread, which is the acquisition of our WebView UA.

Solution

It is confirmed that our time-consuming is caused by obtaining WebView UA, and we can solve it by using local cache: considering that WebView UA records information such as the version of the Webview, which will not change in most cases, so we can cache the Webview UA locally and read it directly from the local later. Each time the application cuts to the background, we will get the WebView UA to update it to the local cache to avoid lag during use.

The cache scheme may not be updated in time when Webview UA changes, etc. If the real-time requirement for WebView is very high, we can also obtain WebView UA from the child process by calling the child process ContentProvider. Although this will affect the main thread of the child process, it will not affect our foreground process. Of course, this method requires starting a child process and requiring complete Webview UA to read. Compared with the local cache method, it has obvious disadvantages in reading speed. It is not suitable for some scenarios that require reading speed. We can adopt the corresponding solution according to actual needs.

2. Backend task optimization

The previous cases are basically time-consuming optimizations related to the main thread. In fact, in addition to the direct time-consuming of the main thread, the time-consuming of the background tasks will also affect our startup speed, because they will seize the CPU, io and other resources of our foreground tasks, resulting in the execution time of the foreground tasks becoming longer. Therefore, while we optimize the foreground time, we also need to optimize our backend tasks. Generally speaking, the optimization of backend tasks is highly correlated with specific business, but we can also organize some common optimization principles:

Reduce unnecessary tasks execution of background threads, especially some tasks that reliably relies on CPU and IO;
Converge the number of threads in the startup stage to prevent too many concurrent tasks from preempting the main thread resources, and also avoid frequent inter-thread scheduling to reduce concurrency efficiency.

In addition to these general principles, here are two more typical backend task optimization cases in Douyin.

2.1 Process startup optimization

In the optimization process, we need to pay attention to the running of the current background thread, but also the running of the background process. At present, most applications have push function. In order to reduce background power consumption and avoid the process being killed due to excessive memory, push-related functions will generally be placed in an independent process. If the push process is started in the startup stage, it will also have a relatively large impact on our startup speed. We try to delay the startup of the push process as appropriate to avoid starting in the startup stage.

In offline cases, we can filter keywords such as "Start proc" in logcat to find out whether there is a situation where the child process is started in the startup stage, and obtain the component information that triggers the child process to start. For some complex projects or three-party SDKs, even if we know the components of the startup process, it is difficult to locate the specific startup logic. We can insert the component calls of the service, Recevier, and ContentProvider, enter the call stack, and combine the components in "Start proc" to accurately locate our trigger points. In addition to the life processes in manifest, there may also be some forks to out-of-native processes, we can discover such processes through adb shell ps.

2.2 GC suppression

There is another typical case where background tasks affect startup speed, which is GC. After triggering GC, it may seize our CPU resources and even cause our thread to be suspended. If there are a large number of GCs during startup, our startup speed will be greatly affected.

One way to solve this problem is to reduce the execution of our startup code and reduce the application and utilization of memory resources. This solution requires us to transform our code implementation, which is the most fundamental solution to the impact of gc on startup speed. At the same time, we can also reduce the impact of GC on startup speed through the general method of GC suppression. Specifically, it is to suppress some types of GC in the startup stage to achieve the purpose of reducing GC.

Recently, the company's Client Infrastructure-App Health team has investigated the GC suppression solution on ART virtual machines. It has tried to optimize the startup speed of the application on some of the company's products. The detailed technical details will be shared on the "ByteDance Terminal Technology" official account after the subsequent polishing is completed.

3. Global optimization

The cases introduced above are basically optimized for some time-consuming points at a certain stage. In fact, we still have some points that are not so time-consuming in a single time, but the high frequency may affect the global point, such as high frequency functions in our business, such as our class loading, method execution efficiency, etc. Here we will introduce some of the optimization attempts of Douyin in these aspects.

3.1 Class loading optimization

3.1.1 ClassLoader optimization

First, let’s take a look at a case of optimization of Douyin in class loading. When talking about class loading, we cannot do without the parent delegation mechanism of class loading. Let’s briefly review the class loading process under this mechanism:

First, look for the loaded class. If it can be found, it will return directly. If it cannot be found, call the loadClass of parent classloader to search;

If the parent clasloader can find the relevant class, it will be returned directly, otherwise findClass is called to class load;

 protected Class < ? > loadClass ( String name , boolean resolve )
 throws ClassNotFoundException
 {
 Class < ? > c = findLoadedClass ( name );
 if ( c == null ) {
 try {
 if ( parent != null ) {
 c = parent . loadClass ( name , false );
 } else {
 c = findBootstrapClassOrNull ( name );
 }
 } catch ( ClassNotFoundException e ) {
 }

 if ( c == null ) {
 c = findClass ( name );
 }
 }
 return c ;
 }

ClassLoader in Android

Optimization of class loading by ART virtual machines

ART virtual machines still follow the principle of parent delegation in class loading, but have made some optimizations in implementation. Generally speaking, its general process is as follows:

First, call the findLoadedClass method of PathClassLoader to find the loaded class to find it. This method will be called to the LookupClass method of ClassLinker through jni, and if it can be found, it will be returned directly;
If it cannot be found in the loaded class , it will not return to the java layer immediately. It will call ClassLinker's FindClassInBaseDexClasLoader in the native layer for class search;
In FindClassInBaseDexClasLoader , first we will determine whether the current ClassLoader is a BootClassLoader. If it is a BootClasLoader, try to find it from the loaded class of the current ClassLoader. If it can be found, it will return directly. If it cannot be found, try to use the current ClassLodaer to load, and return regardless of whether it can be loaded;
If the current ClassLoader is not a BootClassLoader, it will determine whether it is a PathClasLoader. If it is not a PathClassLoader, it will return directly;
If the current ClassLoader is a PathClassLoader, it will determine whether the current PathClassLoader has a parent. If there is a parent, the parent will be passed in and the FindClassInBaseDexClasLoader method is called recursively. If it can be found, it will be returned directly; if it cannot be found or the current PathClassLoader does not have a parent, the class loading is directly carried out in the native layer through DexFile.

It can be seen that when there is only PathClassLoader on the ClassLoadeer link from PathClassLoader, after the findLoadedClass method of the java layer is called, it not only searches the loaded class with its literal meaning, but also directly loads the class through DexFile in the native layer. This method can reduce an unnecessary jni call compared to returning to the java layer to call findClass and then calling back to the native layer to load through DexFile, which is a higher running efficiency. This is an optimization of the class loading efficiency of the art virtual machine.

ClassLoader model in TikTok

Previously, we introduced the class loading mechanism in Android. So what optimizations have we made in class loading? To answer this question, we need to understand the ClassLoader model in Douyin. In order to reduce the volume of the package in Douyin, we dynamically distributed some non-core functions through plug-in. After connecting to the plug-in framework, the ClassLoader model in Douyin is as follows:

In addition to the original BootClassLoader and PathClassLoader, DelegateClassLoader and PluginClasLoader are also introduced;
DelegateClassloader is 1 globally, it is the parent of PathClassLoader, and its parent is BootClassLoader;
PluginClassLoader One per plugin, its parent is BootClassLoader;
The DelegateClassLoader will hold a reference to the PluginClassLoader, and the PluginClassLoader will hold a reference to the PathClasloader;

This ClassLoader model has a very obvious advantage, that is, it can easily support class isolation, reuse, and switch between plug-in and component;

Class isolation: If a class with the same name exists in the host and multiple plug-ins, if a class is used in the host, it will first be loaded from the host apk. If a class in the plug-in is used, it will be loaded from the current plug-in apk. This loading mechanism is not supported by the plug-in framework of the single ClassLoader model of this loading mechanism;
Class reuse: When using a class unique to a plug-in in the host, we can detect the class loading failure in the DelegateClassLoader, and then use PluginClassLoader to load in the plug-in to implement the host to reuse the classes in the plug-in; when using a class unique to a host in the plug-in, we can detect the class loading failure in the PluginClassLoader, and then use PathClassLoader to load to implement the plug-in reuse the classes in the host. This reuse mechanism cannot be supported by other plug-in boxes for multiple ClassLoader models;
Free switching between plug-in and componentization: Under this ClassLoader model, we do not need any displayed ClassLoader specification when loading classes in the host/plugin. We can easily switch between directly dependant componentization methods and compileonly+plugin methods;

ART class loading optimization mechanism is broken

The above introduces the advantages of TikTok's ClassLoader model, but it also has a relatively hidden disadvantage, that is, it will destroy the optimization mechanism of ART virtual machines for class loading.

Non-invasive optimization solution: Delayed injection

Understanding the reasons why plug-in causes negative class loading, the optimization idea is clearer - remove the DelegateClassLoader from between PathCLasLoader and BootClassLoader.

Through the previous analysis, we know that the introduction of DelegateClassLoader is to use PluginClassloader to load in the plug-in when using PathClassLoader loadClass. Therefore, for scenarios where plug-in is not used, DelegateClassloader is completely unnecessary. We can inject DelegateClassloader when we need to use plug-in functions.

However, in actual execution, it will be more difficult to perform full on-demand injection because we cannot accurately grasp the timing of plug-in loading. For example, we may implicitly depend on and load the plug-in classes through compileonly, or we may use a plug-in view in XML to trigger the loading of plug-in. If we want to adapt, it will cause a relatively large intrusion to business development.

Here we try to optimize it with another idea - although we cannot accurately know the time of plug-in loading, we can know where there is no plug-in loading. For example, there is no plug-in loading in the Application stage, so we can wait for the Applicaiton stage to complete before injecting the DelegateClassloader. In fact, during the startup process, the loading of the class is mainly concentrated in the Application stage. By completing the execution of the Applicaiton, the DelegateClassloader can be injected greatly reduce the impact of the plug-in solution on the startup speed, and can also avoid intrusion into the business.

Invasive optimization solution: Transforming the ClassLoader model

The above solution does not require intrusion into the business and the transformation cost is very small, but it only optimizes the class loading in the Application stage. The optimization of class loading by ART in the subsequent stage is still unenjoyable. From the perspective of extreme performance, we have further optimized. The core idea of our optimization is to completely remove the DelegateClassloader from the PathClassLoader and BootClassLoader, and solve the problem of host loading plug-in classes through other methods. Through analysis, we can know that there are several main ways for host loading plug-in classes:

Reflect the class loading plug-in through Class.forName;
CompileOnly implicitly depends on the plug-in class, and the plug-in class is directly loaded at runtime;
The component class that loads the plug-in when starting the four major components of the plug-in;
Use plugin classes in xml;

Therefore, our problem becomes how to implement these four categories of host loading plugins without injecting ClassLoader.

First of all, the Class.forName method. The most direct solution to the problem of plug-in class loading in this method is to specify the ClassLoader as the DelegateClassloader when calling Class.forName. However, this method is not friendly enough for business development and there are some problems that we cannot modify in the three-party SDK. Our final solution is to bytecode instrumentation on the Class.forName call, and try to use DelegateClassloader to load when the class load fails.

Next is the implicit dependency of compileOnly, which is more difficult to generalize, because we cannot find a suitable time to guarantee the failure of class loading. Our solution to this problem is to transform the business, change the implicit dependency call method of compileOnly to use Class.forName. The reason for such transformation is mainly based on several considerations:

First of all, there are very few ways to implicitly depend on calls in Douyin, and the modification cost is relatively controllable;
Secondly, although the compileOnly method is convenient for the use of plug-ins, it is not convergent enough in the entry, and there are certain problems in plug-in load control, problem investigation, and compatibility between plug-in host versions. These problems can be solved better through the Class.forName + interface method.

The problem of loading the four major component classes of the plug-in and using the plug-in class in xml can be solved through the same solution - replace the ClassLoadedApk with the DelegateClassLoader, so that whether it is loading the four major component classes or the class loading when LayoutInflate loads xml, it will be loaded using the DelegateClassLoader. For the principles of this part, you can refer to the analysis of related plug-in principles such as DroidPlugin, Replugin, etc., and I will not introduce it here.

3.1.2 Class verify optimization

For the optimization of ClassLoader, it is optimized in the load stage during the class loading process. It can also be optimized for other stages of class loading. A typical case is the optimization of class. The classverify process mainly checks whether class complies with Java specifications. If it does not comply with the specifications, verification-related exceptions will be thrown in the verification stage.

Generally speaking, class in Android will be verified when the application is installed or the plug-in is loaded, but there are some specific cases, such as plug-ins after Android 10, plug-in compilation adopts the extract filter type, and host and plug-in dependence cause static verification failure, etc., so verify at runtime. In addition to verifying the class, the process of running verification will also trigger the load of the class it depends on, which will cause time-consuming.

In fact, classverify mainly checks the bytecode issued by the network. For our plug-in code, it will verify the legality of the class during the compilation process. Moreover, even if an illegal class really occurs, at most it will transfer the exception thrown in the verification stage to the class when it is used.

Therefore, we can think that classverify at runtime is unnecessary. You can optimize the loading of these classes by turning off classverify. Regarding closing classverify, there are some excellent solutions in the industry, such as locating the memory address where verification_ is located in the memory when the runtime is set to skip verification mode to achieve skipping classverify.

 // If kNone, verification is disabled. kEnable by default.
 verifier :: VerifyMode verify_ ;

 // If true, the runtime may use dex files directly with the interpreter if an oat file is not available/usable.
 bool allow_dex_file_fallback_ ;

 // List of supported cpu abis.
 std :: vector < std :: string > cpu_abilist_ ;

 // Specifies target SDK version to allow workarounds for certain API levels.
 int32_t target_sdk_version_ ;

Of course, closing the optimization solution of classverify does not necessarily have value for all applications. Before optimization, you can output the class information of classverify at runtime in the host and plug-in through the oatdump command. For situations where a large number of classes verify at runtime, you can use the above solution to optimize.

 oatdump -- oat - file = xxx . odex > dump . txt
 cat dump . txt | grep - i "verified at runtime" | wc - l

3.2 Other global optimizations

In terms of global optimization, there are some other more general optimization solutions. Here are some simple introductions for your reference:

High-frequency method optimization: Optimize high-frequency call methods such as service discovery (spi), experimental switch reading, etc., and preload the original annotation reading, reflection and other operations at runtime to the compilation stage. Directly generate target code to replace the original call through the compilation stage to achieve an improvement in execution speed;
IO optimization: improve IO efficiency by reducing unnecessary IO in the startup phase, pre-reading IO on key links, and other general IO optimization solutions;
binder optimization: cache the results of some binders that will be called multiple times in the startup stage to reduce the number of IPCs, such as the acquisition of the packageinfo of our application and network status acquisition;
Lock optimization: reduce the impact of lock problems on startup by removing unnecessary locks, reducing lock particle size, reducing lock holding time, and other common solutions
Bytecode execution optimization: reduces the execution of some unnecessary bytecode through method calls. It has been integrated into Douyin's bytecode open source framework Bytex in the form of a plug-in (see Bytex for details);
Preload optimization: Make full use of the system's concurrency capabilities, accurately and accurately preload various resources on asynchronous threads through user portraits, terminal intelligent prediction, etc., to achieve the purpose of eliminating or reducing the time-consuming of key nodes. The content available for preloading includes sp, resource, view, class, etc.;
Thread scheduling optimization: reduces the Sleeping state and Uninterrupible Sleeping time-consuming through dynamic priority adjustment of tasks and load balancing on different CPU cores, and improves the utilization rate of CPU time slices without increasing CPU frequency (solution provided by the Client Infrastructure-App Health team);
Manufacturer cooperation: cooperate with manufacturers to obtain more system resources through CPU core binding and frequency extraction, etc., to achieve the purpose of improving startup speed;

Summary and Outlook

So far, we have introduced the typical and general cases in Douyin startup optimization. I hope these cases can provide some reference for everyone's startup optimization. Looking back at all Douyin's previous startup-related optimizations, general optimization only accounts for a small part of them, and more are business-related optimizations. This part of the optimization has a strong business correlation and other businesses cannot be directly migrated. Finally, we will summarize and prospect our startup optimization from the perspective of practice, hoping that it will be helpful to everyone.

Continuous iteration

Starting optimization is a process that requires continuous iteration and polishing. Generally speaking, the beginning is a "fast and fierce" rapid optimization stage. This stage will have a relatively large optimization space, and the optimization granularity will be relatively coarse, and good returns can be achieved when there is not much manpower. The second stage is the difficult and difficult stage. The investment required in this stage is greater than that in the first stage, and the final improvement effect also depends on the difficult and difficult stage; the third stage is the process of preventing deterioration and continuous refined optimization. This process is the most lasting process. For products that are rapidly iterated, this stage is also very important and is the only way for us to achieve extreme startup performance.

Scenario generalization

Starting optimization also requires certain expansion and generalization. Generally speaking, we focus on the time when the user clicks on the icon to the homepage first frame, but with the increase of commercial opening and push clicking scenarios, we also need to expand to these scenes. In addition, many times, although the first frame of the page comes out, users still cannot see what they want to see, because the user may not focus on the time of the first frame of the page, but the time when the effective content is loaded. Taking Douyin as an example, while we pay attention to the startup speed, we will also pay attention to the time of the first frame of the video. From the AB experiment, this indicator is even more important than the startup speed. Other products can also combine their own business to define some corresponding indicators to verify the impact on the user experience, and decide whether it is necessary to optimize.

Global awareness

Generally speaking, we measure startup performance by startup speed. In order to improve startup speed, we may delay or on demand some tasks that were originally performed in the startup stage. This method can effectively optimize startup speed, but it may also damage the subsequent user experience. For example, if the background tasks in a certain startup stage are postponed to subsequent use, if the first use is in the main thread, it may cause lag. Therefore, while we focus on startup performance, we also need to pay attention to other indicators that may affect.

In terms of performance, we need to have a macro indicator that can reflect global performance to prevent local optimal effects. In terms of business, we need to establish a relationship between startup performance and business. Specifically, in the optimization process, we support AB capabilities as much as possible for some larger startup optimizations. On the one hand, we can realize qualitative analysis of optimization to prevent some negative optimizations that have local performance benefits but are harmful to the global experience from being brought online; on the other hand, we can also use the qualitative analysis capabilities of the experiment to quantify the effect of each optimization on the business, thereby providing guidance for subsequent optimization directions. At the same time, we can also provide rollback capabilities to stop losses in a timely manner for some changes that may cause stability or functional abnormalities.

At present, Volcano Engine, an enterprise-level technical service platform under ByteDance, has opened its AB experimental capabilities to the public. Interested students can go to the Volcano Engine official website to learn more.

Full coverage and refined operations

In the future, Douyin's startup optimization has two major goals. The first goal is to maximize the coverage rate of startup optimization: in terms of architecture, we hope that the code in the startup stage can rely simply and clear, the module granularity is as small as possible, and the subsequent optimization and iteration cost is low; in terms of experience, while optimizing performance, it is necessary to optimize functions such as interaction and content quality to improve the reach efficiency and quality of functions; in terms of scenarios, it is necessary to fully cover various startup methods, landing pages such as cold start, warm start, and hot start; in terms of optimization directions, it covers various optimization directions such as CPU, IO, memory, lock, and UI rendering. The second goal is to achieve refined operation of startup optimization, achieve thousands of people, and use different startup strategies for different users, different equipment performance and conditions, and different startup scenarios to maximize experience optimization.

<<: WeChat was deleted without knowing it: This app is recommended for you to clear invalid friends with one click

>>: Apple's press conference revealed that iOS 16 is here: more custom features and faster speed

As long as tablets are alive, one market is saturated while another is hungry.

Musk's plan to control the human brain has taken another step forward. Westlake University has developed a 1 square millimeter neural chip. Are you ready to install a chip in your brain?

Blog

Dialogue with Academician Zhu Min, searching for the human ancestor 440 million years ago

Recommend

This large scientific device is about to start construction! What exactly is the attosecond, which is known as the fastest "flashlight" in human history?

According to news from Songshan Lake Science City...

The latest reminder from China CDC: This virus is highly contagious due to airborne and contact transmission, and is currently at its peak!

Recently, the WeChat official account of the Chin...