Author: Wang Haichao BackgroundLive broadcast OOM problems are difficult to locate, mainly because there are many businesses involved and it takes a long time from location to solution. In order to get to the point of the problem in advance and improve the efficiency of location, and also to supplement the existing tools, a live broadcast memory jitter solution - MemoryThrashing is proposed. Why propose this plan?
What is Thrashing?Wikipedia defines thrashing as:
Define memory thrashing from a business perspective: In layman's terms, performance data fluctuates greatly. For example, when memory increases from 600M to 800M in a short period of time, it is called a jitter. We hope to find out where the 200M memory increase comes from through self-developed tools. In actual OOM cases, OOM caused by sudden memory increase is relatively common. The specific phenomena are as follows:
This article takes temporary objects and memory accumulation as examples to explain how to locate this type of problem. The "AllocTime Summary" is used to describe the number of temporary object allocations, and the "Memory Summary" is used to describe the memory accumulation. Temporary ObjectsTemporary objects: A large number of objects are allocated in a short period of time, resulting in large fluctuations in the stability of live broadcasting, which may increase the memory and CPU load. This type of problem usually manifests as a surge in memory or direct OOM in a short period of time, or then quickly falls back to normal levels. Such objects will not reside in memory for too long. By monitoring "temporary objects", such problems can be discovered in advance. The above is the top temporary objects counted by allocation times (AllocTime Summary). "AllocTime Summary 1" represents the allocation times of the first sampled Class, and the others are similar. For example: By diffing the difference between "AllocTime Summary 2" and "AllocTime Summary 1", it can be seen that "LivexxxA" was allocated 7803 times during the sampling period. Since the "Memory Summary" information is not collected, it can be considered that there is no memory residency. Memory accumulationMemory accumulation: A large number of objects reside in the memory, and these objects will not be released in a short period of time, resulting in a high memory level, which can easily trigger the OOM problem. The above is the top instance statistics by memory residency. "Memory Summary 1" represents the memory residency information of the number of instances sampled for the first time, and the others are similar. For example: by diffing "Memory Summary 2" and "Memory Summary 1", it can be seen that "LivexxxA" has increased by 56791 instances during the sampling period. According to the last sampling, a total of 69904 instances are resident in memory. Through sampling, it can be seen that "LivexxxA" increases each time. MemoryThrashing SolutionSolution ResearchThe idea of the solution is to find out the growth by taking memory difference. By sampling the memory information at multiple times (currently mainly monitoring the number of class instances), the memory information is diffed to find the TOP growth and achieve the purpose of attribution.
Memory AreaBy traversing the memory nodes and comparing the number of instances with the registered Class, the advantage of this solution is that it can monitor the number of OC object instances of the entire APP. In the face of the live broadcast business scenario, whether it is necessary to monitor the objects of the entire APP is not needed at present. The starting point of the demand is to monitor the live broadcast scene and meet certain conditions. For example: after watching the live broadcast for a period of time, the memory fluctuates greatly, and the scene is more focused. Another consideration is that if the current memory is relatively large, traversing the zone will be more time-consuming. If the thread is not suspended, there will be potential crash problems and inaccurate data problems. RunTimeBy using the Hook method, the allocation and release times of Class instances are counted to achieve the purpose of recording the number of surviving instances. The growth of OC instances in fixed scenarios can be monitored, such as a sudden increase in memory in the live broadcast room. The scope is relatively small and there is no need to count too many useless objects. This solution takes less time than traversing the memory area and there will be no wild pointer problem. However, it is important to note the impact on performance when monitoring objects. The RunTime solution is currently used. From the offline live broadcast room test, the impact on the main thread is negligible. Solution DesignIn the actual development process, it is found that the creation and release of objects are in a complex multi-threaded environment. Improper handling will have a potential impact on the business, affecting the efficiency of business execution or causing stability problems:
After optimization, a multi-level cache solution is used to solve the performance overhead problem of the main thread, achieving almost zero overhead for the main thread. Monitoring ProcessAfter entering the live broadcast room for a period of time, monitoring is turned on. By monitoring the changes in memory values, it is determined whether the sampling function is turned on. After sampling is turned on, it will enter a continuous multiple sampling stage. After multiple samplings are completed, data will be reported. After the reporting is completed, the memory will continue to be monitored. Data displayIn the hot live broadcast room, memory snapshots are sampled multiple times and TOP 100 data is collected. Taking "LivexxxA" as an example, the second time of two samplings increased by 4125 instances. It can be simply attributed to the "LivexxxA" related business causing "Memory Thrashing", and the investigation can start from the "LivexxxA" related business. Advantages and disadvantages of the solution
Practical CasesCurrently, "MemoryThrashing" has been deployed to monitor the test environment and will be deployed online later. Many problems were exposed in advance through offline observation. Compared with the previous method, problems could only be perceived when they occurred or had obvious impacts, and QA needed to feedback to RD. "MemoryThrashing" greatly improved the troubleshooting efficiency and discovered degradation problems in advance. The following are two cases. Memory accumulationAs shown below, a large number of object allocation problems occurred in multiple sampling cycles, and these objects were not released, which caused a significant increase in memory. Sampling cycle 3 allocated 234,024 more objects than sampling cycle 2, and finally 238,800 "LivexxxBigDataRead" objects resided in the memory, occupying 10.9M of memory. Temporary ObjectsThe following is a problem encountered in the live broadcast scene. When the anchor starts the barrage carnival, after the face is recognized by Effect, a corresponding contour model will be created and given to the middle station to draw the contour. The frequency will be very high. The peak value of temporary object increment can reach 60,000 (the difference between the last two samplings) every 5 seconds (the actual time is shorter). Since the "Memory Summary" information is not generated, it can be considered that it is not resident in the memory. The cumulative object allocation exceeds one million times, which will have a direct impact on the live broadcast performance: Future plansAttribution AbilityIn some cases, only counting OC object data may not be enough. For example, if there is an abnormal growth of common basic objects, there is no way to track the specific cause. If there is an object reference relationship, the problem can be further locked. Of course, these are all supplements to the "Memory Graph" capability. If the "Memory Graph" has captured the data, it can be combined with the "Memory Graph" to lock the object reference link and then find the business.
CPU Monitoring According to previous cases, many OOM and ANR are accompanied by high CPU usage. For example, in a case, the OOM problem was caused by a large amount of data processing. After investigation, it was found that the CPU usage of the thread responsible for the business processing was very high. Therefore, it is necessary to supplement the monitoring by monitoring the CPU usage of the thread. The suspected business can be locked by the thread name and stack. |
<<: Why do you advise everyone to wait for the official version of iOS 15.6? There are four reasons
>>: Alibaba B-side case! Cainiao Intelligent Design Middle Platform Design Review
With the rise of knowledge payment, audio payment...
Augmented reality (AR) is increasing brand awaren...
At 18:00 on December 29, 2017 Beijing time, for m...
Ps+Ai double major! The first compulsory course f...
WHO declares monkeypox outbreak a "public he...
Review expert: Zhou Hongzhi, senior experimenter ...
It is no exaggeration to say that promotion is th...
Recently, I have been I went to West China Hospit...
Author: Xu Yadong (China University of Geoscience...
According to foreign media reports, more than half...
The dilemma of public domain traffic can only be ...
[[162319]] Wu Lei, head of Umeng data platform: T...
We have sorted out seventeen popular Internet pro...
What is phpstudy? It is a very convenient and eas...
For many brands, the first time they experience t...