About the Author: Huang Haoyu Currently working in Tencent's Social Network Operations Department, responsible for the business operation and maintenance of SNG's social network business mobile products, such as QQ and Qzone business optimization and development. Previously, he worked at Alibaba, responsible for the operation and maintenance of Tmall's event-related businesses, such as Tmall Double 11 and Tmall Anniversary Celebration. Introduction The mobile Internet is developing so fast, and operation and maintenance technology must also adapt to business changes. This time, the editor found a Tencent expert to introduce the speed optimization practices of mobile QQ and mobile Qzone. We firmly believe that the division of operations and maintenance in different vertical fields will become more and more different. How to use operation and maintenance technology and data to bring greater value to the business in different business forms will be the focus of our next exploration. 1. About user waiting time For users, the most intuitive feeling is the waiting time of the APP, so we must first analyze where the APP makes users wait and where the time is wasted. The waiting time is nothing more than the following three:
Products such as QQ/Qzone have been optimized on the server side for many years. Most of the data is directly read and written into the nosql database. The interface takes about 30-120ms to complete. The actual benefits of optimizing the server are not that great. The following mainly introduces the optimization practices in the latter two directions. 2. Network transmission First, we need to count the time spent on network transmission to know how valuable it is to optimize network transmission. 2.1 Network transmission time statistics The network time consumption is counted on the server side through the three-way handshake of the TCP protocol. The advantages are simple, fast and low-cost. The specific solution is as follows:
Figure 2.1 Measuring network latency from the server According to actual data statistics, without cross-network access (normal signal):
Judging from the speed results, the current mainstream 3G/4G network speed is still quite good, but due to the complexity of mobile networks, there are still many problems found in the business return code monitoring of QQ and Space:
Below is the optimization strategy of mobile Qzone in access components. 2.2 Mobile Qzone WNS Access Strategy Introduction: WNS, a communication framework from mobile QQ space APP to server, supports tcp and http protocols 2.2.1 Direct IP long connection access using private protocol (Figure 2.2) advantage:
Disadvantages: Since the domain name is not used, the first connection requires additional strategies to find the appropriate access point and requires redirection capabilities Figure 2.2 Private protocol direct IP long connection 2.2.2 First connection strategy The furthest distance in the world is that you are on China Unicom and I am on China Telecom. In a complex mobile network environment, we need to optimize the network access strategy to avoid cross-network/cross-region access. When using the mobile network, we first identify the user's operator and start 4 connections at the same time, multiple access IPs + multiple ports + 2 protocols. The reason for using 2 protocols and multiple ports at the same time is to avoid restrictions of some local operators. The connection on the first connection is used (see Figure 2.3) Figure 2.3 The first concurrent connection attempt Users using WIFI will try to connect using the domain name first when connecting for the first time. When the above strategies fail to connect, the client will run the scoring strategy and use the backup IP list to connect to the fastest access. Tencent has a large number of CDN nodes in China, and even remote areas can access them through CDN nodes as agents! Advantages: Multiple first connection strategies can effectively ensure that users are connected to the server as much as possible, which is especially important in complex mobile networks! Disadvantages: There is additional overhead for the first connection; the connection may not be the optimal access point; the cost of using CDN nodes as proxy access is high 2.2.3 Optimal Access & Redirection After the connection is established, the server identifies the user's exit IP through the GSLB IP library. If it is found that the user's access is not the optimal access, it will use big data to analyze the access point that the user should use most in a certain period of time, and will issue a redirection instruction to allow the client to connect to the optimal server access IP. The SSID and access IP will also be cached under WIFI. Advantages: Allow users to access the nearest/optimal network, reducing network time consumption Disadvantages: A small number of users need to connect to the server twice for the first time; 2.2.4 Using Dictionary for Data Compression Reduce bandwidth overhead; secure 2.2.5 Heartbeat Avoid long connection disconnection 2.2.6 Single Connection Concurrent Requests Compared with the traditional HTTP mode of multiple connections and single request (before HTTP 2.0), using a single connection can greatly reduce the client and server overhead in conclusion The optimization we can do on mobile networks is nothing more than reducing connections, reducing requests, avoiding cross-network and cross-region traffic, and optimizing protocols. With the rapid development of 4G/fiber optics, more and more users will spend less and less time on the Internet, which means that the optimization effect of our network strategy will also be less and less profitable. At this time, we turn our attention to terminals. 3. Terminal time consumption As above, you first need to confirm the time consumption of the terminal to confirm the optimization expectations and goals. Through the reporting and monitoring of the client-side embedded points, it was found that the rate of no response for more than 3 seconds after some operations of users in a grayscale version of mobile Qzone was as high as 30%; the rate of frame drops in a grayscale version of mobile QQ due to UI problems was about 15%. In the classification of complaints, the number of complaints about lag, slowness and freeze has long been among the top three. It can be concluded that the terminal problem is very serious and is directly related to the user's operating experience! 3.1 Android/IOS system background Since you want to optimize mobile clients, you need to have a basic understanding of operating systems (Android and IOS). Both are systems developed based on UNIX/LINUX, and many concepts are easy to understand for operation and maintenance personnel. One of the more important design concepts is that both Android and IOS can be developed in multiple threads, one of which is the main thread, also known as the UI thread. The UI thread is the only thread that has the authority to operate the user UI. If the user has an experience problem during operation, it must be because the main thread is blocked or does not have enough running resources. So start with the monitoring of the main thread and the occupation of system resources. 3.2 Monitoring Strategy How to judge whether the terminal has performance problems such as slowness or lag? Based on the background introduction of Android and iOS, our goal is to monitor the main thread. There are two main monitoring strategies: 1). Time consuming to monitor function calls When the main thread calls a function for more than N seconds, the main thread is in a waiting and blocked state, and all user UI behaviors are suspended, so it is considered that the terminal is stuck. Disadvantages: Cannot accurately reflect the user experience Advantages: Low implementation cost, low overhead 2). Monitor screen FPS and frame drops When the page drops frames during user operation, it is considered that the user is experiencing slowness or freeze (Figure 3-1) Advantages: It truly reflects the user experience and can classify the lag and jam experience into short lag and long lag. Disadvantages: There is additional FPS monitoring overhead, which accounts for about 2% of the entire APP overhead after testing. As shown in Figure 3-1, the number of FPS of the monitoring screen 3.3 Stack Collection There are monitoring strategies. Next, we should consider how to cooperate with the monitoring strategy to obtain the data of the "crime scene" and report it to the server. In addition to system resources such as CPU and memory, the most important "crime scene" data must be the code execution stack data. Due to the limited performance resources of mobile terminals, when collecting stack data, you must pay close attention to the impact on the system, so you need to determine the timing of triggering the collection of the stack. There are mainly two collection schemes: 3.3.1 Enable additional threads to record the main thread stack An additional child thread is started, which records the stack data of the main thread. When a lag occurs, the stack data is obtained from the thread. The advantage is that only a very small SDK package needs to be introduced, and the version of the compilation method and virtual machine are ignored. The strategy for obtaining the stack is also divided into passive strategy and active strategy. Negative strategies: It is believed that the slow and stuck problem will only occur once in a short period of time. If you miss it, you will not be able to obtain the real on-site stack. The strategy is as follows: the child thread always obtains the stack of the main thread. When a problem occurs in the main thread, the child thread obtains the stack data at the time of the incident through the start and end timestamps of the problem (as shown in Figure 3-2). Disadvantages: The child thread needs to record the main thread stack at all times, which is costly Advantages: The acquired stack data is accurate Figure 3-1 Monitoring the main thread function call time Active strategies: It is believed that the slowness and lag problems will occur several times in a short period of time or continue for a period of time. The strategy is: when a problem occurs in the main thread, activate the child thread to obtain the stack, and obtain X stacks in the child thread within the next N seconds. Disadvantages: The stack is random, and the stack obtained is the stack after the incident. Advantages: Very little extra overhead, basically no impact on the APP 3.3.2 Stub/embed in the compilation phase By using tools at the compilation stage to add a time-consuming statistics function at each function call point Disadvantages: Increase the size of the APP package. After testing, the package size of the APP is increased by about 10~20%. Different compilation methods and virtual machines require different tools to support stubbing and embedding; lack of system call data Advantages: No additional thread overhead required at runtime Both solutions have their own advantages and merits, but due to the strict restrictions on package size, QQ and Qzone currently mainly use solution 1. 3.4 Big Data Cluster Analysis As mentioned above, the passive strategy of solution 1 has a greater impact on terminal performance, but the data obtained by the active strategy is random, that is, the client cannot accurately capture the problem stack. At present, we mainly use the method of active strategy + big data cluster analysis to analyze the problem. The basic idea of this solution is that if a piece of logic code really has performance problems, then most users will have them. Therefore, we use the method of clustering analysis on the stack data to find out the stacks that can form data scale and filter out irrelevant stacks that are occasionally obtained due to randomness. For the clustering statistics of stacks, we mainly solve it by constructing CT (Climbing Tree). ClimbingTree is the internal name. The main idea is to generate a stack tree through a stack, and use massive data to perform weighted calculations (mainly function time consumption) on the tree. Finally, the nodes on the same layer are sorted from left to right according to the weight, and the nodes below the set threshold are pruned. The characteristic of ClimbingTree is that the weights of child nodes of the same parent node decrease from left to right. 3.4.1 Building a CT (Climbing Tree) Graph First, a user's reported stack data is preprocessed, including decrypting files, translating stack functions, formatting stacks, filtering out irrelevant data, etc., and finally a business function call relationship chain is generated. According to the call relationship, multiple call relationship chains of the same user are merged, the time consumption of the same nodes is added up, and the time consumption of each tree node is sorted from left to right to generate a function call relationship tree (see Figure 3-3) Figure 3-3 Function call relationship tree By merging the call relationship trees of multiple users and pruning the low-weight node branches below the threshold, a CT (Climbing Tree) can be generated. This tree contains the data aggregation of all problem stacks, and the problem severity is sorted from left to right (see Figure 3-4). Assuming that each node takes 1 second, the call chain ABC in CT is likely to be the function call chain where the problem lies (because the time consumed by node C to the parent node accounts for 2/4=50%) Figure 3-4 CT image The advantage of CT is that it aggregates massive amounts of data into a small number of forest data nodes (compressing about 90%-95% of the data volume). Since the left child node must take longer than the right node, the left child node is often the problem that affects the parent node. By analyzing the proportion of the left child node's time consumption to the parent node, we can find the root cause of the time-consuming function (see Figure 3-4 and Figure 3-5). Figure 3-5 Finding the root cause of time-consuming function nodes 3.5 Summary of common terminal performance issues The most common problem is to perform long-running operations on the main thread, such as
Commonly used optimization methods: Use child threads to perform asynchronous operations, such as database write operations, configure network pull, etc. to preload what can be preloaded in advance, such as using the time when the APP is opened and waiting for the home page to open a long network connection, preload video and audio data, etc. Asynchronous deferred processing that can be deferred, such as SD card checking, asynchronous message sending, etc. 3.6 Cases & Effects Complaints about slowness and lag in some optimized versions of QQ IOS: Qzone Android: The lag rate of some versions (lag rate = number of users experiencing lag/number of users) 4. Summary In the era of fast-growing mobile Internet, operation and maintenance technology must adapt to business changes. The speed optimization practices of mobile QQ and mobile Qzone introduced in this article are small examples of Tencent operation and maintenance using big data technology to create value for the business. We firmly believe that with the development of operation and maintenance positions, the division of operation and maintenance in different vertical fields will also emerge. How to use operation and maintenance technology and data to bring greater value to the business in different business forms and let the data speak for itself will be the focus of our next exploration. |
<<: Use of Android performance analysis tools
Due to the COVID-19 epidemic, experts have repeat...
On March 30, 2022, General Secretary Xi Jinping p...
[[441083]] The electronic version of the driver...
According to recent news, Cui Dongshu, secretary-...
The entire industry will quickly become inward-lo...
Cui Xiaosu, a reporter from China Securities Jour...
In this article, the author will explore the oper...
Golden Week Day 5 of Vacation Mode Friends are ce...
Have you noticed that restaurants love to use Bas...
With the rapid development of material conditions...
The range of current Air Force weapons forces pil...
Kunming tea tasting comes with its own studio. Re...
According to CCTV News Client on the 29th, Huawei...
Many companies pay great attention to the creatio...