Android Native memory analysis solution based on Rust

Android Native memory analysis solution based on Rust

Background: The vehicle system environment in which the Amap Car Edition runs is mostly based on a customized Android system, and the underlying code of the Amap Car Edition is all C/C++ Native code. Therefore, a general Native memory performance analysis solution is needed on Android. MemTower is a solution based on the open source project memory-profiler, ported to Android, and optimized and improved. It solves the pain points of the previous solution and meets the general Native memory performance analysis needs. The project is written in Rust and uses some features of Rust to complete the Hook for Native memory access.

1. Android Native memory analysis pain points and demands

This section mainly introduces why we do this and what goals we expect to achieve.

1.1 Existing tool defects

Android has very complete performance analysis tools at the Java level, but there is no complete solution at the Native level. This is mainly reflected in:

  • It does not support Android 4.x. Online statistics show that 4.x version of car computers still account for a large proportion, so this has become a problem that cannot be ignored.
  • The malloc_debug function that comes with Android behaves differently in different versions, and most car Android systems have been customized by system manufacturers, so these functions cannot be guaranteed to be available.

Therefore, it is impossible to perform native memory performance analysis based on the Android system's own functions.

Our team has made some achievements in this regard before, but there are still the following problems:

  • Hooking the native code function entry/end position by modifying the compilation parameters caused serious performance degradation;
  • Since it is an intrusive analysis, memory problem analysis needs to be compiled separately for package analysis, which greatly reduces the solution efficiency. The cost of troubleshooting a memory leak problem is calculated on a daily basis.
  • Lack of accurate memory usage data.

1.2 Create a complete Native memory performance analysis solution

In view of the pain points of the problems encountered, we hope to have a complete Native memory performance analysis solution. The specific demands are reflected in the following points:

  • Supports most Android systems including Android 4.x.
  • Non-intrusive analysis, memory issues are discovered and accurately located simultaneously.
  • Excellent performance with low overhead.
  • Support long-term memory leak stress testing. R&D teams including car manufacturers will perform stress testing on navigation, which needs to be able to support long-term stress testing and locate memory leak problems.
  • Function-level memory usage data. The original solution focused on solving the problem of memory leaks, and the memory usage data obtained was not accurate enough. We hope that the new solution can obtain detailed memory usage data to support memory performance optimization.

2. MemTower Solution

This section mainly introduces the implementation of the memory-profiler project and the process of porting the MemTower solution to the Android platform and the improvement of the original solution. It also explains how we achieved and met the above requirements.

2.1 Choosing Rust & Memory-profiler

In response to the demands of the customers, we hoped to find a new solution. I was studying Rust at the time, so I found the memory-profiler (hereinafter referred to as mp) project by searching for keywords on GitHub. The author, koute, is a former Nokia engineer. Then the memory tower came into being. This section mainly explains how mp combines Rust to implement the relevant principles and functions of memory profile.

2.1.1 Hook Implementation

The solution usually used for native memory performance analysis is to hook memory call requests such as malloc and free. The principle of mp is the same. It uses LD_PRELOAD to preload custom libraries to implement the hook of memory operation functions. The biggest problem with this solution is that it is easy to cause a circular malloc call. As shown in the figure below, after the program memory request is hooked, the memory request of the hooked business itself will also trigger a memory request, resulting in a circular malloc call and a stack crash.

mp's approach takes advantage of Rust's customizable memory allocator (Allocator), uses Rust's former default memory allocator jemalloc as a custom allocator, and replaces the final memory application mmap with a custom function entry in the C code of jemalloc-sys (thereby distinguishing the application's and its own mmap calls), and finally calls the mmap system call.

After forwarding the Rust memory request to the system call, the application's memory request needs to be passed to the system libc.mp. This is done through Rust's feature switch. You can choose two ways to handle application memory requests. Both methods are implemented by specifying the link_name attribute in Rust:

  • Directly forward application memory requests to libc through __libc_malloc's link_name
  • By specifying the function entry _rjem_malloc of jemallocator, the application and Rust share jemalloc.
  • This will eventually allow Hook services to use the full Rust language functionality without having to worry about loop call crashes caused by Rust's own code.

2.1.2 High-performance stack reverse analysis

In addition to using the Rust system programming language features to avoid memory loop calls, the author also used Rust's high performance features to implement several high-performance stack inversions.

Use the stack traceback information provided by the .eh_frame section of ELF (C++ exception handling mechanism).

Stack backtrace based on .ARM.exidx + .ARM.extab, this is the unwind table provided by ARM.

For the specific implementation, please refer to the author's Crate not-perf. Here we choose the second method for illustration. As shown in the figure below, a stack frame cache is maintained for each thread stack using thread local storage. This cache comes from the unwind table information in the ELF file. When the stack frame misses the cache, the corresponding binary unwind table will be loaded into the memory. When it hits, there is no need to read the file. Usually, the address space of the binary will not change after it is loaded, so the cache efficiency is very high. The disadvantage is that each thread has a complete set of caches. From the system level, the memory overhead is very large.

2.1.3 Powerful data analysis capabilities

From the mp page, we can see that in addition to the memory profile, there is also a corresponding data analysis server, which uses the actix-web framework and has a very powerful analysis function. The main features are as follows:

  • The timing curves of both memory usage and leakage perspectives are very intuitive.
  • It is equipped with a very powerful filter that can implement filtering queries on multiple dimensions such as memory lifecycle, function, time, and its corresponding memory flame graph function.
  • All functions have RESTful API interfaces and can be customized very easily.

The detailed instructions for use will not be introduced here.


2.2 Migration

After understanding the basic principles of mp, in this section we mainly explain the various problems (pitfalls) encountered in the process of porting to the Android platform.

2.2.1 Custom Allocator

There are many problems with mp's Hook solution on the Android platform, which are mainly reflected in the following points:

  • Jemalloc itself was only introduced to Android in Android 5.0. The jemalloc-sys that comes with mp will cause two jemallocs to exist in one application, which will eventually manifest as various abnormal crashes on different versions, making problem troubleshooting an obstacle.
  • __libc_malloc is an alias for the malloc function entry provided by glibc, but there is no corresponding implementation on the Android platform.

Therefore, we use the most primitive dlsym method to obtain the entry of memory-related functions, and then encapsulate it into Rust Allocator. The application's memory requests also use these function addresses. As shown in the following figure, all memory requests are eventually passed to libc, so that Rust's business code is transparent to libc.

2.2.2 Stack Backtrace

There are also some porting modifications for stack backtracking. As mentioned above, the author provides a stack backtracking method based on the C++ exception handling mechanism, but this solution requires dependence on the C++ library. C will become a default dependency only after Android 8.0. This requires that applications must also rely on the C++ library when running versions before 8.0. Therefore, we removed this stack backtracking solution and discarded this dependency.

2.2.3 Address Space Overload

When the program starts or calls dlopen/dlclose, the linker will load (or unload) the ELF file. Accordingly, the address space of the program will change. At this time, the address space in the stack traceback cache may become invalid and need to be reloaded. The reload operation scans the changes in the entire address space, which is very costly. At the same time, a low-cost way to obtain address space changes is also needed. There are two main ways to implement mp:

The interface dl_iterate_phdr provided by libc does not exist in Android API_LEVEL lower than 21 (i.e. before 5.0). The structure of this function after 5.0 is different from the implementation in higher versions of Android. Therefore, the single C structure format defined by Rust will cause dirty data to be read as the basis for reloading, resulting in very frequent reloading.;

Perf's PERF_RECORD_MMAP2 event requires a kernel version greater than 3.16, so it is not available on Android 4.x either.

In actual operation, after loading all dependent ELFs, the address space rarely changes. Therefore, we modified it to reload the address space only when a new ELF is loaded. The flame graph results show that the computational cost of Hook can be greatly reduced.

2.3 Improvements

So far, Memory Tower can run correctly on Android versions that support LD_PRELOAD (including 4.x). However, there is still one thing that cannot be met in the above requirements: long-term memory leak stress testing. In addition, during the data analysis process, we hope to have more dimensional information. Therefore, this section mainly introduces our improvements to Memory Tower.

2.3.1 Memory Leak Stress Test

The original positioning of mp is just as its name indicates, it is a memory performance analysis tool that records the full amount of memory information. This determines its data volume. In multiple business scenarios with a long-term stress test of one hour, the generated sampling data files range from 1GB to 7GB depending on the memory usage. Such data volume cannot meet the needs of the business.

Therefore, we added a memory leak detection mode (ONLY_LEAKED), the principle of this mode is as follows:

  • Each stack frame recorded in the memory allocation is recorded in a trie tree, and the size of the allocated memory is recorded at the same time.
  • When memory is released, the corresponding node information of the dictionary tree is updated. If the current leak reaches a certain threshold (such as 100MB), sampling is stopped.
  • When sampling ends, write the unreleased memory records stored in the entire dictionary tree to the file.

The advantage of this mode is that the final data volume is very small. The actual data file size for one hour of stress testing is between 100 and 200 MB. After compression by the postprocess subcommand provided by mp, the size is less than 100 MB. The disadvantage is that the memory tower needs to cache a full amount of stack history data in memory. When no new stack frame records appear, the memory growth will stabilize.

2.3.2 Enhanced Analysis Filters

There are many business module divisions and threads in the navigation, so filter options for regular filtering by thread and library are added.

2.3.3 Improvement of memory flame graph

The original mp solution used memory size (allocated) as the flame graph dimension. When analyzing memory performance, memory allocation times (allocations) is also a very important indicator, so a flame graph of memory allocation times was added. This was the earliest improved function, and the shape of the flame graph was similar to a tower, so the project was renamed: Memory Tower (MemTower).

The last point is that the flame graph information of the original solution is not divided by thread. It will be more intuitive if we divide the stack information by thread.

Allocation count flame graph

Allocation size flame graph

3. Memory Tower Capabilities and More Possibilities

The last section describes what capabilities, benefits, and possibilities the Memory Tower provides.

3.1 Capabilities

MemTower relies on setprop wrap.com.xxx.xxx and root permissions in Android 8.0 and below. If you do not have root permissions in versions above 8.0, you can also load the MemTower library by configuring the Android project wrap.sh. In addition, because mp natively supports Linux, we have also successfully adapted embedded Linux project car machines such as Mercedes-Benz Daimler.

  • Supported platforms: Android 4.x, 5.1.1 and 7 or higher (5.0 and 6 have a bug and cannot set setprop ). Linux x86_64, AArch64, Arm.
  • Sampling method: Non-intrusive. Non-root devices can choose the intrusive method.
  • Sampling mode: general performance analysis mode and memory leak stress testing mode.
  • Features: High-performance stack inversion, complete memory analysis Insight experience (multi-dimensional filter analysis, memory flame graph, etc.).

The original memory leak problem was discovered, and the process of re-packaging and secondary stress testing and analysis, and then inferring the possible leakage point took days to calculate. Using the memory tower (MemTower) to do a test, refined data can be analyzed in a few minutes, greatly reducing the cost of analyzing memory performance problems. The set of Hook ideas and high-performance stack reverse analysis provided by mp can actually be not limited to memory analysis, but also for IO performance analysis or other problems.

<<:  Will digital currency replace Alipay and WeChat? Insiders: There is no comparison at all!

>>:  Why is there a "Developer Options" on every phone?

Recommend

How do you monetize your product? Here are 3 channels to help you sort out

As dividends disappear and capital returns to rat...

Full process analysis of improving APP push conversion rate

An APP without message push function cannot be ca...

Do you have the habit of "washing your butt"? Don't be shy, it's a good habit...

We wash our faces every day, and most people soak...

Is your product suitable for influencer marketing?

Have you tried influencer marketing ? In 2017, in...

Just bite me, why are you buzzing at me?

Review expert: Mo Jianchu, Professor of Institute...

Short video app “Tik Tok”: Will it become popular for a while and then die?

I wrote an article before about how the rapidly g...

Juliang Qianchuan’s live streaming sales promotion skills!

The key points of live streaming sales by Juliang...

How much does it cost to develop a baking utensils mini program in Zhangbei?

How much is the quotation for the development of ...

Zuckerberg's new bet: Making Facebook social in VR

Last week, Mark Zuckerberg played 20 minutes of z...