Meituan Waimai iOS App Cold Start Management

Meituan Waimai iOS App Cold Start Management

1. Background

Cold start duration is an important indicator of App performance. As the first "door" to user experience, it directly determines the user's first impression of the App. Since November 2013, the Meituan Takeaway iOS client has undergone dozens of versions of iterative development, with the product form constantly improved and business functions becoming increasingly complex. At the same time, the takeaway App has evolved from an independent business App to a platform App, and has gradually connected to other new businesses such as flash sales and errands. Therefore, more and more complex work needs to be completed when the App is cold started, which poses a challenge to the App's cold start performance. In response to this, our team has carried out continuous and targeted optimization work on cold start based on changes in business forms and the characteristics of the takeaway App, with the aim of presenting a smoother user experience.

[[251799]]

2. Cold Start Definition

Generally speaking, the iOS cold start process is defined as: from the time the user clicks the App icon to the completion of the appDelegate didFinishLaunching method. This process is mainly divided into two stages:

  • T1: Before the main() function, the operating system loads the App executable file into memory, then performs a series of loading & linking tasks, and finally executes to the App's main() function.
  • T2: After the main() function, that is, from main() to the completion of the didFinishLaunchingWithOptions method of appDelegate.

However, when didFinishLaunchingWithOptions is executed, the user has not yet seen the main interface of the App and cannot start using the App. For example, in a food delivery App, the App still needs to do some initialization work, and then go through the processes of positioning, home page request, home page rendering, etc. before the user can actually see the data content and start using it. We believe that the cold start is completed at this time. We define this process as T3.

In summary, the food delivery app defines the cold start process as: the process from when the user clicks the app icon to when the user can see the content of the app main interface, that is, T1+T2+T3. In the app cold start process, there are many points that can be optimized in each of these three stages.

3. Current Problem

Performance inventory issues

After dozens of versions of iterative development, the Meituan Takeaway iOS client has accumulated several performance issues during the cold start process. Solving these performance bottlenecks is the primary goal of cold start optimization. These issues mainly include:

Note: The definition of a startup item is a task that needs to be completed during the App startup process. For example, initializing a certain SDK, preloading a certain function, etc.

Performance Increment Problem

Generally speaking, in the early stages of an app, there will be no obvious performance issues with cold start. Cold start performance issues do not suddenly appear in a certain version. Instead, as the version iterates, the app functions become more and more complex, the startup tasks increase, and the cold start time is extended a little bit. Finally, when we noticed it and wanted to optimize it, the problem had become very tricky. The incremental performance issues of food delivery apps mainly come from the increase in startup items. As the version iterates, the startup item tasks are simply and crudely piled up in the startup process. If the cold start time increases by 0.1s for each version, then after a few versions, the cold start time will increase significantly.

IV. Governance ideas

There are three main goals for managing cold start performance issues:

  • Solve existing problems: optimize current performance bottlenecks, optimize the startup process, and shorten the cold start time.
  • Manage incremental issues: Standardize the cold start process, guide the maintenance of subsequent cold start process codes through code paradigms and documentation, and control time increments.
  • Improve monitoring: Improve cold start performance indicator monitoring, collect more detailed data, and discover performance problems in a timely manner.

5. Standardize the startup process

As of the end of 2017, the number of Meituan Takeaway users has reached 250 million, and the Meituan Takeaway App has also completed the evolution from an App supporting a single business to a platform App supporting multiple businesses (the promotion, support and thinking of Meituan Takeaway iOS multi-terminal reuse), and some of the company's emerging businesses have also been integrated into the Takeaway App. The following is the architecture diagram of the Takeaway App. The architecture of the takeaway is mainly divided into three layers. The bottom layer is the basic component layer, the middle layer is the takeaway platform layer, the platform layer manages the basic components downward and provides a unified adaptation interface for the business components upward. The upper layer is the basic component layer, including the sub-business components split from the takeaway business (the takeaway channel in the takeaway App and Meituan App can reuse sub-business components) and other non-takeaway businesses connected.

The platformization of apps provides businesses with an efficient, standardized, and unified platform. However, at the same time, the platformization and rapid iteration of businesses also bring problems to cold starts:

  • The existing startup items are seriously accumulated, slowing down the startup speed.
  • New startup items lack a paradigm for adding, are disorganized, have high risks of modification, and are difficult to read and maintain.

To address this problem, we first sorted out all the startup items in the current startup process, and then designed a new startup item management method for App platformization: phased startup and startup item self-registration

Phased launch

In the early days, due to the simplicity of the business, all startup items were not differentiated and were simply piled into the didFinishLaunchingWithOptions method. However, as the business grew, more and more startup item codes were piled together, resulting in poor performance and bloated and confusing code.

Through the sorting and analysis of SDK, we found that startup items also need to be classified according to the tasks completed. Some startup items need to be executed immediately after startup, such as Crash monitoring, statistical reporting, etc., otherwise it will lead to the lack of information collection; some startup items need to be completed at an earlier time node, such as some SDKs that provide user information, initialization of positioning functions, network initialization, etc.; some startup items can be delayed, such as some custom configurations, some business service calls, payment SDKs, map SDKs, etc. The phased startup we did first reasonably divided the startup process into several startup stages, and then assigned them to the corresponding startup stages according to the priority of what each startup item did, with high priority items placed in the early stages and low priority items placed in the later stages.

The following is our redefinition of the startup phase of the Meituan Takeaway App, sorting out and reclassifying all startup items, and matching them to reasonable startup phases. This can postpone the execution of startup items that do not need to be executed too early, shortening the startup time; on the other hand, classifying startup items makes it easier to read and maintain them later. Then these rules are implemented as maintenance documents for startup items to guide the addition and maintenance of subsequent startup items.

Through the above work, we have sorted out more than a dozen startup items that can be postponed, accounting for about 30% of all startup items, effectively optimizing the cold start time occupied by the startup items.

Startup item self-registration

After determining the phased startup plan, the next question we face is how to execute these startup items. The easiest solution is to create a startup manager at startup, read all startup items, and then trigger the startup item execution when the time node arrives. This method has two problems:

  1. All startup items must be written into a file in advance (imported in a .m file, or organized in a .plist file). This centralized writing method will lead to bloated code that is difficult to read and maintain.
  2. The startup item code cannot be reused: The startup item cannot be integrated into the sub-business library and needs to be implemented repeatedly in the Food Delivery App and the Meituan App, which is inconsistent with the direction of the Food Delivery App platformization.

What we hope for is that the startup item maintenance method is pluggable, the startup items and business modules are not coupled, and a single implementation can be reused on both ends. The figure below shows the startup item management method we use, which we call self-registration of startup items: a startup item is defined inside a sub-business module, encapsulated into a method, and self-declares the startup phase (for example, a startup item A can be declared to be executed in the willFinishLaunch phase in an independent App, and in the resignActive phase in the Meituan App). In this way, the startup items are reused on both ends, unrelated startup items are isolated from each other, and adding/deleting startup items is more convenient.

So how do you declare the startup phase for a startup item? And how do you trigger the execution of the startup item at the right time? In the code, a startup item will eventually correspond to the execution of a function, so as long as you can get the pointer of the function at runtime, you can trigger the startup item. Kylin, a startup governance infrastructure component developed by the Meituan platform, does exactly this: the core idea of ​​Kylin is to write data (such as function pointers) into the __DATA segment of the executable file during compilation, and then retrieve the data from the __DATA segment to perform corresponding operations (call functions) at runtime.

Why do we need to borrow the __DATA segment? The reason is to be able to cover all startup stages, such as the stage before main().

Kylin implementation principle: Clang provides many compiler functions that can perform different functions. One of them is the section() function. The section() function provides the ability to read and write binary segments. It can write some constants that can be determined at compile time into the data segment. In the specific implementation, it is mainly divided into two parts: compile time and runtime. At compile time, the compiler will write the data marked with attribute ((section())) to the specified data segment, for example, write a {key (key represents different startup stages), *pointer} pair to the data segment. At runtime, at the appropriate time node, the function pointer is read according to the key to complete the function call.

The above method can be encapsulated into a macro to simplify the code. Taking the call of macro KLN_STRINGS_EXPORT("Key", "Value") as an example, it will eventually be expanded to:

  1. __attribute__((used, section ( "__DATA"   ","   "__kylin__" ))) static const KLN_DATA __kylin__0 = (KLN_DATA){(KLN_DATA_HEADER){ "Key" , KLN_STRING, KLN_IS_ARRAY}, "Value" };

In the example, the compiler registers the startup function to startup phase A:

  1. KLN_FUNCTIONS_EXPORT(STAGE_KEY_A)() { // In the am file, by registering the macro, declare startup item A to be executed in the STAGE_KEY_A stage
  2. // Startup item code A
  3. }
  4.  
  5. KLN_FUNCTIONS_EXPORT(STAGE_KEY_A)() { // In the bm file, declare startup item B to be executed in the STAGE_KEY_A stage
  6.     

In the startup process, STAGE_KEY_A triggers all startup items registered to the STAGE_KEY_A time node during the startup phase. In this way, there is almost no additional auxiliary code, and we have completed the self-registration of startup items in a very concise way.

  1. - (BOOL)application:(UIApplication *)application didFinishLaunchingWithOptions:(NSDictionary *)launchOptions {
  2. // Other logic
  3. [[KLNKylin sharedInstance] executeArrayForKey:STAGE_KEY_A]; // Trigger all startup items registered to the STAGE_KEY_A time node here
  4. // Other logic
  5. return YES;
  6. }

After completing the sorting and optimization of the existing startup items, we also output the subsequent startup item addition and maintenance specifications, standardizing the classification principles, priorities and startup stages of subsequent startup items. The purpose is to control the incremental performance issues and ensure the optimization results.

6. Before optimizing main()

Before calling the main() function, almost all the work is done by the operating system. There is not much that developers can do. So if you want to optimize this time, you must first understand what the operating system does before main(). Before main(), the operating system loads the executable file (Mach-O format) into the memory space, then loads the dynamic link library dyld, and then performs a series of dynamic link operations and initialization operations (loading, binding, and initialization methods). There is a lot of information on this topic online, but it is quite repetitive. Here is a WWDC Topic: Optimizing App Startup Time.

Loading process - from exec() to main()

The actual loading process starts with the exec() function, which is a system call. The operating system first allocates a memory space for the process and then performs the following operations:

  1. Load the executable file corresponding to the App into memory.
  2. Load Dyld into memory.
  3. Dyld performs dynamic linking.

Let's briefly analyze what Dyld does in each stage:

Finally, dyld will call the main() function, main() will call UIApplicationMain(), and the before main() process is completed.

After understanding the loading process before main(), we can analyze some factors that affect T1 time:

  • The more dynamic libraries are loaded, the slower the startup.
  • The more methods an ObjC class has, the slower it starts.
  • The more +loads you have for ObjC, the slower it starts.
  • The more C constructor functions there are, the slower the startup will be.
  • The more C++ static objects there are, the slower the startup.

In view of the above points, we have done the following optimization work:

Code Slimming

As the business iterates, new codes are constantly added, and useless codes and resource files are also discarded. However, useless codes and files are often abandoned in the corner of the project and are not cleaned up in time. These useless parts increase the package size of the App on the one hand, and slow down the cold start speed of the App on the other hand, so it is very necessary to clean up these useless codes and resources in time.

Through understanding Mach-O files, we can know that __TEXT:__objcmethname: contains all methods in the code, and \_DATA__objc_selrefs contains references to all used methods. By taking the difference between the two sets, we can get all unused code. The core method is as follows, for details, please refer to: objc_cover:

  1. def referenced_selectors(path):
  2. re_sel = re.compile( "__TEXT:__objc_methname:(.+)" ) //Get all methods
  3. refs = set ()
  4. lines = os.popen( "/usr/bin/otool -v -s __DATA __objc_selrefs %s" % path).readlines() # ios & mac //The method actually used
  5. for line in lines:
  6. results = re_sel.findall(line)
  7. if results:
  8. refs. add (results[0])
  9. return refs
  10. }

Using this method, we identified more than a dozen useless classes and 250+ useless methods.

+load optimization

Currently, there are more or less +load methods written in iOS Apps to perform some operations when the App starts. The +load method is executed in the Initializers stage, but too many +load methods will slow down the startup speed, especially for large and medium-sized Apps. Through the analysis of the +load methods in the App, it is found that although many codes need to be initialized at an early time when the App starts, they do not need to be in a very early position like +load. They can be delayed to a certain time node after the App cold start, such as some routing operations. In fact, +load can also be treated as a startup item, so in the specific implementation of replacing the +load method, we still use the above Kylin method.

Example of use:

  1. // Replace the + load statement with the WMAPP_BUSINESS_INIT_AFTER_HOMELOADING statement , no other changes are needed
  2. WMAPP_BUSINESS_INIT_AFTER_HOMELOADING() {
  3. // Code in the original + load method
  4. }
  5.  
  6. // Trigger all methods registered to this stage at an appropriate time, such as after the cold start is completed
  7. [[KLNKylin sharedInstance] executeArrayForKey:@kWMAPP_BUSINESS_INITIALIZATION_AFTER_HOMELOADING_KEY]
  8. }

7. Optimize time-consuming operations

The main work after main() is to execute various startup items (described above), build the main interface, such as TabBarVC, HomeVC, etc. Load resources, such as image I/O, image decoding, archive documents, etc. These operations may contain some time-consuming operations, which are very difficult to find by simply reading. How to find these time-consuming points? Finding the right tool will make things easier.

Time Profiler

Time Profiler is a time performance analysis tool that comes with Xcode. It tracks the stack information of each thread at fixed time intervals, and calculates how long a method has been executed by comparing the stack status between time intervals, and obtains an approximate value. There are many tutorials on how to use Time Profiler online, so we will not introduce them in detail here. Here is a usage document: Instruments Tutorial with Swift: Getting Started.

Flame graph

In addition to the Time Profiler, the flame graph is also a powerful tool for analyzing CPU time consumption. Compared with the Time Profiler, the flame graph is clearer. The product of the flame graph analysis is a call stack time consumption picture. It is called a flame graph because the entire graph looks like a dancing flame. The tip of the flame is the top of the call stack, and the bottom is the bottom of the stack. The vertical direction represents the depth of the call stack, and the horizontal direction represents the consumed time. The wider the width of a grid, the more likely it is to be a bottleneck. The main way to analyze the flame graph is to look at those wider flames, and pay special attention to those flames that are similar to "Pingdingshan". The following is the analysis effect diagram of Caesium, a performance analysis tool developed by the Meituan platform:

By analyzing the flame graph, we found many problems in the cold start process and successfully optimized the time by 0.3S+. The optimization content is summarized as follows:

8. Optimizing Serial Operations

During the cold start process, many operations are executed in series. If several tasks are executed in series, it will take a long time. If the series can be changed to parallel, the cold start time can be greatly shortened.

Using the splash page

Now many apps do not directly enter the home page when they start up, but instead display a splash screen page that lasts for a short period of time. If used properly, this splash screen page can help us save some startup time. Because when an app is more complex, it is a time-consuming process to build the UI of the app for the first time when it starts up. Assuming this time is 0.2 seconds, if we build the home page UI first and then add this splash screen page to the Window, then during a cold start, the app will actually get stuck for 0.2 seconds. However, if we first use the splash screen page as the RootViewController of the app, then the construction process will be very fast. Because the splash screen page only has a simple ImageView, and this ImageView will be displayed to the user for a short period of time, we can use this time to build the home page UI, killing two birds with one stone.

Cache positioning & home page pre-request

An important serial process in the cold start process of Meituan Takeaway App is: home page positioning-->home page request-->home page rendering process. These three operations account for about 77% of the entire home page loading time. Therefore, if you want to shorten the cold start time, you must optimize from these three points.

The previous serial operation process is as follows:

The optimized design uses the client cache positioning to pre-request the home page data while initiating positioning, so that positioning and requesting are performed in parallel. Then, when the user's real positioning is successful, it is determined whether the real positioning hits the cache positioning. If it hits, the pre-request data just now is valid, which can save about 40% of the home page loading time, and the effect is very obvious; if it does not hit, the pre-request data is abandoned and re-requested.

IX. Data Monitoring

Both Time Profiler and Caesium flame graph can only analyze the time-consuming operations of the App on a single device offline, which has great limitations and cannot monitor the performance of the App on the user's device online. The food delivery app uses the company's self-developed Metrics performance monitoring system to monitor the performance indicators of the App over a long period of time, helping us understand the real performance of the App in various online environments and provide reliable data support for technical optimization projects. One of the core indicators monitored by Metrics is the cold start time.

Cold start start & end time nodes

End time point: The end time is relatively easy to determine. We can use the display of certain view elements on the home page as a sign that the loading of the home page is complete.

Start time: Generally, we start to take over the App after main(), but it is obviously inappropriate to use the main() function as the cold start start point, because it is impossible to count the T1 time period. So, how to determine the start time? There are currently two common methods in the industry. One is to use the execution time of the +load method of any class in the executable file as the starting point; the other is to analyze the dependency of dylib, find the dylib of the leaf node, and then use the execution time of the +load method of one of the classes as the starting point. According to the loading order of dylib by Dyld, the latter is earlier. However, the starting points obtained by these two methods are only in the Initializers stage, and the duration before Initializers is not counted. Metrics takes a different approach and uses the process creation time of the App (that is, the execution time of the exec function) as the start time of the cold start. Because the system allows us to obtain relevant information about the process through the sysctl function, including the timestamp of process creation.

  1. #import <sys/sysctl.h>
  2. #import <mach/mach.h>
  3.  
  4. + (BOOL)processInfoForPID: (int )pid procInfo:(struct kinfo_proc*)procInfo
  5. {
  6. int cmd[4] = {CTL_KERN, KERN_PROC, KERN_PROC_PID, pid};
  7. size_t size = sizeof(*procInfo);
  8. return sysctl(cmd, sizeof(cmd)/sizeof(*cmd), procInfo, & size , NULL , 0) == 0;
  9. }
  10.  
  11. + (NSTimeInterval)processStartTime
  12. {
  13. struct kinfo_proc kProcInfo;
  14. if ([self processInfoForPID:[[NSProcessInfo processInfo] processIdentifier] procInfo:&kProcInfo]) {
  15. return kProcInfo.kp_proc.p_un.__p_starttime.tv_sec * 1000.0 + kProcInfo.kp_proc.p_un.__p_starttime.tv_usec / 1000.0;
  16. } else {
  17. NSAssert( NO , @ "Unable to obtain process information" );
  18. return 0;
  19. }
  20. }

The timing of process creation is very early. After experiments, in a newly created blank App, the process creation time is 12ms earlier than the execution time of the +load method in the leaf node dylib, and 13ms earlier than the execution time of the main function (experimental equipment: iPhone 7 Plus (iOS 12.0), Xcode 10.0, Release mode). The data of the online food delivery app is even more obvious. For the same model (iPhone 7 Plus) and system version (iOS 12.0), the process creation time is 688ms earlier than the execution time of the +load method in the leaf node dylib. In all models and system versions, this data is 878ms.

Cold start process time node

We also set a series of speed measurement points at all key nodes in the cold start process of the App. Metrics will record the name of the speed measurement point and the length of time from the process creation time. We did not use the automatic marking method because the cold start process of the food delivery app is very complicated, and automatic marking cannot be so detailed and impractical. In addition, Metrics records a set of sequential time points on the timeline with the process creation time as the origin, rather than a set of time periods, because sequential time points can calculate the distance between any two time points, that is, time points can be processed into time periods. However, a set of time periods may not be restored to sequential time points, because the time periods may not be connected end to end, especially for asynchronous execution or multi-threading.

After the speed test is completed, Metrics will uniformly report all speed test points to the backend. The following figure is a screenshot of some process node monitoring data of Meituan Takeaway App version 6.10:

Metrics will also aggregate the data in the background to obtain the total cold start duration and the 50th, 90th, and 95th percentile statistics of the duration of each speed measurement point, so that we can understand the distribution of cold start duration from a macro perspective. In the figure below, the horizontal axis is the duration and the vertical axis is the number of reported samples.

10. Conclusion

For fast-iterating apps, as the business complexity increases, the cold start time will inevitably increase. The cold start process is also a relatively complex process. When encountering a cold start performance bottleneck, we can optimize from multiple aspects and angles based on the characteristics of the app itself and the use of tools. At the same time, optimizing the cold start inventory problem is only the first step in cold start governance, because cold start performance problems are not caused in one day, and cannot be simply solved through one optimization work. We need to effectively control the increment of performance problems through reasonable design and standardized constraints, and timely discover and correct performance problems through continuous online monitoring, so as to ensure a good App cold start experience in the long run.

About the Author

Guo Sai, a senior engineer at Meituan-Dianping, joined Meituan in 2015 and is currently the main developer of the food delivery iOS team, responsible for mobile business development and the construction and maintenance of business infrastructure.

Xu Hong, a senior engineer at Meituan-Dianping, joined Meituan in 2016 and is currently the main developer of the food delivery iOS team, responsible for mobile APM performance monitoring and high-availability infrastructure support related promotion work.

<<:  iOS 12.1.1 is officially released: Do you want to try the new features?

>>:  Android official emulator supports Fuchsia's Zircon kernel

Recommend

Only through cooperation can the smart home cake be expanded

The world is so amazing that you never know it. W...

Brand promotion: How to do Spring Festival marketing?

It’s the Spring Festival again. Brand owners will...

"Travel" certification vehicle inspection data analysis this week

In the week of late June and early July 2023, the...

Victoria's Secret 13-Day Yoga Body Shaping Class

Victoria's Secret 13-Day Yoga Body Shaping Cou...

A "mouth cannon" about architecture

Author: Duan Hechen The title of the article is v...

The end of the taste is the poetry, wine, mountains, rivers and seas.

Written by Wei Shuihua No.1 Food and wine pairing...

A study of 150,000 apps found that the Android system is as secure as nothing

Recently, a study from Ohio State University, New...