MemoryThrashing: TikTok Live Broadcasting Solution to Memory Thrashing

Author: Wang Haichao

Background

Live broadcast OOM problems are difficult to locate, mainly because there are many businesses involved and it takes a long time from location to solution. In order to get to the point of the problem in advance and improve the efficiency of location, and also to supplement the existing tools, a live broadcast memory jitter solution - MemoryThrashing is proposed.

Why propose this plan?

The existing "MemoryGraph" tool can analyze the causes of OOM through the captured "MemoryGraph" file, such as memory leaks and OOM problems caused by excessive memory usage. However, due to the large performance overhead, it is sampled and reported with a low sampling rate, which makes it difficult to find the problem. It can only be enabled for known users. We hope to develop a tool that can detect problems when memory grows and can also be used for analysis after OOM occurs, while having low performance overhead and full sampling capabilities;
When "MemoryGraph" is generated, it may not be the high memory. For example, if the device memory is 4G, the generated "MemoryGraph" may be 1G, which will affect OOM analysis;

What is Thrashing?

Wikipedia defines thrashing as:

In computer science, thrashing occurs when a computer's virtual memory resources are overused, leading to a constant state of paging and page faults, inhibiting most application-level processing.([1]) This causes the performance of the computer to degrade or collapse.

Define memory thrashing from a business perspective:

In layman's terms, performance data fluctuates greatly. For example, when memory increases from 600M to 800M in a short period of time, it is called a jitter. We hope to find out where the 200M memory increase comes from through self-developed tools. In actual OOM cases, OOM caused by sudden memory increase is relatively common. The specific phenomena are as follows:

Memory does not fall back: Memory surges usually occur within one or two minutes, from 1 GB to 3 GB. This part of memory will remain in the memory and will not be released or will not be released, resulting in OOM. At the same time, the memory water level is raised, which makes it easy for OOM problems to occur.
Memory drops: Memory increases suddenly to a certain level and then drops without causing OOM. This phenomenon usually indicates that the memory problem is not deteriorated enough, or the machine itself has enough memory and is not prone to OOM. Although it does not cause OOM, it is also a potential problem.

This article takes temporary objects and memory accumulation as examples to explain how to locate this type of problem. The "AllocTime Summary" is used to describe the number of temporary object allocations, and the "Memory Summary" is used to describe the memory accumulation.

Temporary Objects

Temporary objects: A large number of objects are allocated in a short period of time, resulting in large fluctuations in the stability of live broadcasting, which may increase the memory and CPU load. This type of problem usually manifests as a surge in memory or direct OOM in a short period of time, or then quickly falls back to normal levels. Such objects will not reside in memory for too long. By monitoring "temporary objects", such problems can be discovered in advance.

The above is the top temporary objects counted by allocation times (AllocTime Summary). "AllocTime Summary 1" represents the allocation times of the first sampled Class, and the others are similar. For example: By diffing the difference between "AllocTime Summary 2" and "AllocTime Summary 1", it can be seen that "LivexxxA" was allocated 7803 times during the sampling period. Since the "Memory Summary" information is not collected, it can be considered that there is no memory residency.

Memory accumulation

Memory accumulation: A large number of objects reside in the memory, and these objects will not be released in a short period of time, resulting in a high memory level, which can easily trigger the OOM problem.

The above is the top instance statistics by memory residency. "Memory Summary 1" represents the memory residency information of the number of instances sampled for the first time, and the others are similar. For example: by diffing "Memory Summary 2" and "Memory Summary 1", it can be seen that "LivexxxA" has increased by 56791 instances during the sampling period. According to the last sampling, a total of 69904 instances are resident in memory. Through sampling, it can be seen that "LivexxxA" increases each time.

MemoryThrashing Solution

Solution Research

The idea of the solution is to find out the growth by taking memory difference. By sampling the memory information at multiple times (currently mainly monitoring the number of class instances), the memory information is diffed to find the TOP growth and achieve the purpose of attribution.

Memory area: traverse the memory nodes to count the number of Class instances;
Runtime: Count the number of surviving instances through alloc and dealloc counting;

Memory Area

By traversing the memory nodes and comparing the number of instances with the registered Class, the advantage of this solution is that it can monitor the number of OC object instances of the entire APP. In the face of the live broadcast business scenario, whether it is necessary to monitor the objects of the entire APP is not needed at present. The starting point of the demand is to monitor the live broadcast scene and meet certain conditions. For example: after watching the live broadcast for a period of time, the memory fluctuates greatly, and the scene is more focused. Another consideration is that if the current memory is relatively large, traversing the zone will be more time-consuming. If the thread is not suspended, there will be potential crash problems and inaccurate data problems.

RunTime

By using the Hook method, the allocation and release times of Class instances are counted to achieve the purpose of recording the number of surviving instances. The growth of OC instances in fixed scenarios can be monitored, such as a sudden increase in memory in the live broadcast room. The scope is relatively small and there is no need to count too many useless objects. This solution takes less time than traversing the memory area and there will be no wild pointer problem. However, it is important to note the impact on performance when monitoring objects. The RunTime solution is currently used. From the offline live broadcast room test, the impact on the main thread is negligible.

Solution Design

In the actual development process, it is found that the creation and release of objects are in a complex multi-threaded environment. Improper handling will have a potential impact on the business, affecting the efficiency of business execution or causing stability problems:

There will be thread safety issues when placing the container under multi-threading;
Excessive use of locks will block the execution of business code and may also trigger the Watchdog mechanism, causing the app to be killed;

After optimization, a multi-level cache solution is used to solve the performance overhead problem of the main thread, achieving almost zero overhead for the main thread.

Monitoring Process

After entering the live broadcast room for a period of time, monitoring is turned on. By monitoring the changes in memory values, it is determined whether the sampling function is turned on. After sampling is turned on, it will enter a continuous multiple sampling stage. After multiple samplings are completed, data will be reported. After the reporting is completed, the memory will continue to be monitored.

Data display

In the hot live broadcast room, memory snapshots are sampled multiple times and TOP 100 data is collected. Taking "LivexxxA" as an example, the second time of two samplings increased by 4125 instances. It can be simply attributed to the "LivexxxA" related business causing "Memory Thrashing", and the investigation can start from the "LivexxxA" related business.

Advantages and disadvantages of the solution

plan	advantage	shortcoming
“Memory Thrashing”	Multiple sampling can be performed to compare memory growth trends. The performance overhead is small and the full amount can be processed online. Memory problems can be detected in advance. It is easy to use and problems can be detected by the number of objects.	It does not support multiple languages, and is limited to oc language. It does not have the ability to analyze memory leaks through memory node relationships, and can only find accumulated objects. It does not have the ability to analyze multiple memory areas. The Hook method affects method caching.
"MemoryGraph"	Strong problem detection ability: can analyze OOM problems caused by memory leaks through memory node relationships; can count the memory usage of memory areas; applicable to multiple languages; complex to get started, need to sort out memory node reference relationships;	Thread suspension affects business execution, which is noticeable to users. The higher the memory usage, the more time-consuming it takes to traverse the memory area. Only a small amount of sampling can be done.

Practical Cases

Currently, "MemoryThrashing" has been deployed to monitor the test environment and will be deployed online later. Many problems were exposed in advance through offline observation. Compared with the previous method, problems could only be perceived when they occurred or had obvious impacts, and QA needed to feedback to RD. "MemoryThrashing" greatly improved the troubleshooting efficiency and discovered degradation problems in advance. The following are two cases.

Memory accumulation

As shown below, a large number of object allocation problems occurred in multiple sampling cycles, and these objects were not released, which caused a significant increase in memory. Sampling cycle 3 allocated 234,024 more objects than sampling cycle 2, and finally 238,800 "LivexxxBigDataRead" objects resided in the memory, occupying 10.9M of memory.

Temporary Objects

The following is a problem encountered in the live broadcast scene. When the anchor starts the barrage carnival, after the face is recognized by Effect, a corresponding contour model will be created and given to the middle station to draw the contour. The frequency will be very high. The peak value of temporary object increment can reach 60,000 (the difference between the last two samplings) every 5 seconds (the actual time is shorter). Since the "Memory Summary" information is not generated, it can be considered that it is not resident in the memory. The cumulative object allocation exceeds one million times, which will have a direct impact on the live broadcast performance:

Future plans

Attribution Ability

In some cases, only counting OC object data may not be enough. For example, if there is an abnormal growth of common basic objects, there is no way to track the specific cause. If there is an object reference relationship, the problem can be further locked. Of course, these are all supplements to the "Memory Graph" capability. If the "Memory Graph" has captured the data, it can be combined with the "Memory Graph" to lock the object reference link and then find the business.

"MemoryThrashing" can be combined with object reference relationship calculation. From the perspective of efficiency, it is not necessary to search for the reference relationship of all objects, which is time-consuming. It is only necessary to search for the reference relationship of key objects at the top growth points. In actual testing, it may only be necessary to search for the reference relationship of a few objects.
Record information through thread stack sampling;

CPU Monitoring

According to previous cases, many OOM and ANR are accompanied by high CPU usage. For example, in a case, the OOM problem was caused by a large amount of data processing. After investigation, it was found that the CPU usage of the thread responsible for the business processing was very high. Therefore, it is necessary to supplement the monitoring by monitoring the CPU usage of the thread. The suspected business can be locked by the thread name and stack.

<<: Why do you advise everyone to wait for the official version of iOS 15.6? There are four reasons

>>: Alibaba B-side case! Cainiao Intelligent Design Middle Platform Design Review