Background: The vehicle system environment in which the Amap Car Edition runs is mostly based on a customized Android system, and the underlying code of the Amap Car Edition is all C/C++ Native code. Therefore, a general Native memory performance analysis solution is needed on Android. MemTower is a solution based on the open source project memory-profiler, ported to Android, and optimized and improved. It solves the pain points of the previous solution and meets the general Native memory performance analysis needs. The project is written in Rust and uses some features of Rust to complete the Hook for Native memory access. 1. Android Native memory analysis pain points and demands This section mainly introduces why we do this and what goals we expect to achieve. 1.1 Existing tool defects Android has very complete performance analysis tools at the Java level, but there is no complete solution at the Native level. This is mainly reflected in:
Therefore, it is impossible to perform native memory performance analysis based on the Android system's own functions. Our team has made some achievements in this regard before, but there are still the following problems:
1.2 Create a complete Native memory performance analysis solution In view of the pain points of the problems encountered, we hope to have a complete Native memory performance analysis solution. The specific demands are reflected in the following points:
2. MemTower Solution This section mainly introduces the implementation of the memory-profiler project and the process of porting the MemTower solution to the Android platform and the improvement of the original solution. It also explains how we achieved and met the above requirements. 2.1 Choosing Rust & Memory-profiler In response to the demands of the customers, we hoped to find a new solution. I was studying Rust at the time, so I found the memory-profiler (hereinafter referred to as mp) project by searching for keywords on GitHub. The author, koute, is a former Nokia engineer. Then the memory tower came into being. This section mainly explains how mp combines Rust to implement the relevant principles and functions of memory profile. 2.1.1 Hook Implementation The solution usually used for native memory performance analysis is to hook memory call requests such as malloc and free. The principle of mp is the same. It uses LD_PRELOAD to preload custom libraries to implement the hook of memory operation functions. The biggest problem with this solution is that it is easy to cause a circular malloc call. As shown in the figure below, after the program memory request is hooked, the memory request of the hooked business itself will also trigger a memory request, resulting in a circular malloc call and a stack crash. mp's approach takes advantage of Rust's customizable memory allocator (Allocator), uses Rust's former default memory allocator jemalloc as a custom allocator, and replaces the final memory application mmap with a custom function entry in the C code of jemalloc-sys (thereby distinguishing the application's and its own mmap calls), and finally calls the mmap system call. After forwarding the Rust memory request to the system call, the application's memory request needs to be passed to the system libc.mp. This is done through Rust's feature switch. You can choose two ways to handle application memory requests. Both methods are implemented by specifying the link_name attribute in Rust:
2.1.2 High-performance stack reverse analysis In addition to using the Rust system programming language features to avoid memory loop calls, the author also used Rust's high performance features to implement several high-performance stack inversions. Use the stack traceback information provided by the .eh_frame section of ELF (C++ exception handling mechanism). Stack backtrace based on .ARM.exidx + .ARM.extab, this is the unwind table provided by ARM. For the specific implementation, please refer to the author's Crate not-perf. Here we choose the second method for illustration. As shown in the figure below, a stack frame cache is maintained for each thread stack using thread local storage. This cache comes from the unwind table information in the ELF file. When the stack frame misses the cache, the corresponding binary unwind table will be loaded into the memory. When it hits, there is no need to read the file. Usually, the address space of the binary will not change after it is loaded, so the cache efficiency is very high. The disadvantage is that each thread has a complete set of caches. From the system level, the memory overhead is very large. 2.1.3 Powerful data analysis capabilities From the mp page, we can see that in addition to the memory profile, there is also a corresponding data analysis server, which uses the actix-web framework and has a very powerful analysis function. The main features are as follows:
The detailed instructions for use will not be introduced here. 2.2 Migration After understanding the basic principles of mp, in this section we mainly explain the various problems (pitfalls) encountered in the process of porting to the Android platform. 2.2.1 Custom Allocator There are many problems with mp's Hook solution on the Android platform, which are mainly reflected in the following points:
Therefore, we use the most primitive dlsym method to obtain the entry of memory-related functions, and then encapsulate it into Rust Allocator. The application's memory requests also use these function addresses. As shown in the following figure, all memory requests are eventually passed to libc, so that Rust's business code is transparent to libc. 2.2.2 Stack Backtrace There are also some porting modifications for stack backtracking. As mentioned above, the author provides a stack backtracking method based on the C++ exception handling mechanism, but this solution requires dependence on the C++ library. C will become a default dependency only after Android 8.0. This requires that applications must also rely on the C++ library when running versions before 8.0. Therefore, we removed this stack backtracking solution and discarded this dependency. 2.2.3 Address Space Overload When the program starts or calls dlopen/dlclose, the linker will load (or unload) the ELF file. Accordingly, the address space of the program will change. At this time, the address space in the stack traceback cache may become invalid and need to be reloaded. The reload operation scans the changes in the entire address space, which is very costly. At the same time, a low-cost way to obtain address space changes is also needed. There are two main ways to implement mp: The interface dl_iterate_phdr provided by libc does not exist in Android API_LEVEL lower than 21 (i.e. before 5.0). The structure of this function after 5.0 is different from the implementation in higher versions of Android. Therefore, the single C structure format defined by Rust will cause dirty data to be read as the basis for reloading, resulting in very frequent reloading.; Perf's PERF_RECORD_MMAP2 event requires a kernel version greater than 3.16, so it is not available on Android 4.x either. In actual operation, after loading all dependent ELFs, the address space rarely changes. Therefore, we modified it to reload the address space only when a new ELF is loaded. The flame graph results show that the computational cost of Hook can be greatly reduced. 2.3 Improvements So far, Memory Tower can run correctly on Android versions that support LD_PRELOAD (including 4.x). However, there is still one thing that cannot be met in the above requirements: long-term memory leak stress testing. In addition, during the data analysis process, we hope to have more dimensional information. Therefore, this section mainly introduces our improvements to Memory Tower. 2.3.1 Memory Leak Stress Test The original positioning of mp is just as its name indicates, it is a memory performance analysis tool that records the full amount of memory information. This determines its data volume. In multiple business scenarios with a long-term stress test of one hour, the generated sampling data files range from 1GB to 7GB depending on the memory usage. Such data volume cannot meet the needs of the business. Therefore, we added a memory leak detection mode (ONLY_LEAKED), the principle of this mode is as follows:
The advantage of this mode is that the final data volume is very small. The actual data file size for one hour of stress testing is between 100 and 200 MB. After compression by the postprocess subcommand provided by mp, the size is less than 100 MB. The disadvantage is that the memory tower needs to cache a full amount of stack history data in memory. When no new stack frame records appear, the memory growth will stabilize. 2.3.2 Enhanced Analysis Filters There are many business module divisions and threads in the navigation, so filter options for regular filtering by thread and library are added. 2.3.3 Improvement of memory flame graph The original mp solution used memory size (allocated) as the flame graph dimension. When analyzing memory performance, memory allocation times (allocations) is also a very important indicator, so a flame graph of memory allocation times was added. This was the earliest improved function, and the shape of the flame graph was similar to a tower, so the project was renamed: Memory Tower (MemTower). The last point is that the flame graph information of the original solution is not divided by thread. It will be more intuitive if we divide the stack information by thread. Allocation count flame graph Allocation size flame graph 3. Memory Tower Capabilities and More Possibilities The last section describes what capabilities, benefits, and possibilities the Memory Tower provides. 3.1 Capabilities MemTower relies on setprop wrap.com.xxx.xxx and root permissions in Android 8.0 and below. If you do not have root permissions in versions above 8.0, you can also load the MemTower library by configuring the Android project wrap.sh. In addition, because mp natively supports Linux, we have also successfully adapted embedded Linux project car machines such as Mercedes-Benz Daimler.
The original memory leak problem was discovered, and the process of re-packaging and secondary stress testing and analysis, and then inferring the possible leakage point took days to calculate. Using the memory tower (MemTower) to do a test, refined data can be analyzed in a few minutes, greatly reducing the cost of analyzing memory performance problems. The set of Hook ideas and high-performance stack reverse analysis provided by mp can actually be not limited to memory analysis, but also for IO performance analysis or other problems. |
<<: Will digital currency replace Alipay and WeChat? Insiders: There is no comparison at all!
>>: Why is there a "Developer Options" on every phone?
Reviewer: Yang Rongya, Chief Physician, Seventh M...
On the evening of January 8, Zeekr posted on Weib...
As dividends disappear and capital returns to rat...
An APP without message push function cannot be ca...
We wash our faces every day, and most people soak...
SAIC Motor released its November production and s...
Have you tried influencer marketing ? In 2017, in...
Review expert: Mo Jianchu, Professor of Institute...
I wrote an article before about how the rapidly g...
The key points of live streaming sales by Juliang...
How much is the quotation for the development of ...
Author: Duan Yuechu Cheng Wing Chun Huang Yanhong...
Too little exposure and no sales? How to set a re...
Last week, Mark Zuckerberg played 20 minutes of z...
What kind of financial report did Ideal Auto, the...