APM from entry to abandonment: Analysis of availability monitoring system and optimization methods

[51CTO.com original article] In the movie "The Matrix", a classic scene is to let Neo choose between the red pill and the blue pill. The red pill is a tracking program that helps Neo locate his physical body position. No matter where he is, any problem can be located and solved immediately. Developers basically know that the difficulty of solving most functional problems lies in positioning, and some artificial intelligence, machine learning, and virtual reality technologies that appear in the movie can only be seen in science fiction movies.

Quarterly active device growth trend

Today, driven by the explosion of mobile terminals and user demand, the "quantity" and "volume" of mobile applications have expanded rapidly. APP performance data has become increasingly important in optimizing products. A large number of domestic APM manufacturers seem to have sprung up overnight. The entire monitoring system has also continuously strengthened and changed strategies from the server end to the APP end and then to the H5 end to adapt to the needs of different scenarios, which has changed the nature of monitoring and optimization.

The embryonic development of APM

In 1996, Tivo and HP started from the application level, believing that the network is undoubtedly the speed of the application. It was not until 1998 that APM products for component-centric infrastructure monitoring appeared. It was not until 2011 that the popularity of mobile devices and the explosion of the APP application market made people more and more demanding on the performance experience of mobile terminals.

At this time, New Relic and AppDynamics in the foreign APM industry have already taken the lead in the APM field. Some domestic APM vendors have seen this trend of mobile, and APM seems to have blossomed overnight. To this day, the more representative APM vendors in China include: Tingyun, OneAPM, Yunzhihui, Borui, etc. The current BAT field has also entered this field. Alibaba Baichuan Mali APM (abbreviated as "Mali APM") also released a public beta at the Yunqi Conference. Developers do not need to build performance probes, data platforms and consoles from scratch. They can monitor application performance for a long time and solve problems in applications in a timely manner in a visual and maintainable way.

▲ Changes in the relationship between APM business and IT development

APM Availability Measurement System

Nowadays, the competition in the domestic APM business is becoming increasingly fierce, and everyone is working hard on usability and user experience. For example, when people use Taobao on their mobile phones, they obviously feel that the stability and fluency are much better than other domestic e-commerce apps. This is not only because they have a bunch of excellent development engineers, but more importantly, the perfect performance monitoring and measurement system behind them.

Through the performance monitoring system, all performance indicators occurring on the app will be reported in real time, and the MaLi APM server will cluster and analyze these indicators to aggregate problems and performance bottlenecks. At the same time, the complete log information will also support development engineers to make timely repairs and optimizations.

Chen Wu, a technical expert at Alibaba, believes that in terms of performance optimization, previous measurements were made by comparing the opening rates of apps, which are very subjective. A big problem facing the measurement system is normalization. So, how should we establish this visual performance measurement system?

Alibaba Baichuan divides the performance indicators that affect user usage into usability metrics and experience metrics.

1. Usability metrics

Availability includes app availability and service availability. The most common app availability problem is crash, and most users will choose to uninstall the app directly after encountering a crash; service availability issues include network connection and server errors, which may often cause key operations such as user purchases and subscriptions to be unavailable, resulting in asset losses. If such problems are not resolved for a long time, they will also lead to user loss.

Such problems need to be fixed as soon as possible. The sooner they are fixed, the better the stop-loss effect will be.

This requires the client probe to have powerful collection capabilities. The probe SDK will be responsible for collecting user crashes caused by various reasons such as thread exceptions, memory overflow, and mobile phone process killing, and capture as comprehensive environmental information as possible and user operation traces to help developers restore user operations and locate problems. At the same time, the same is true for network requests. The probe SDK needs to support automatic collection of network performance indicators and capture logs of erroneous network requests to assist development engineers in solving problems.

However, the probes collect only single events on the user app side. If 1,000 users have availability issues, the server may receive 1,000 logs. It is obviously not feasible for development engineers to troubleshoot problems in a massive amount of logs. This requires the APM server to perform semantic analysis and efficient clustering of these logs in real time. For example, 1,000 user logs can be aggregated into 3 issues and fed back to developers through the console. This will greatly improve the efficiency of development engineers in troubleshooting and solving problems.

2. APP experience measurement

APP experience is the key to user retention and activity. People have a natural liking for "silky smoothness" when using APP. However, the experience of most APPs in the market is still very poor. Users often face various undesirable experiences such as freezes, image loading failures, and long page waits. At this time, there is a great need for a systematic display and measurement of these experience issues.

The APM console handles jams in a similar way to crashes. Jams of the same type will be clustered together, and detailed logs of users who experienced the jams will also be clustered together and can be flipped through pages for review. For problems such as image loading failures and page elements not being displayed properly, you can pay attention to whether the service host of the static resource where the image is located is abnormal (too many requests per minute, the image is too large, etc.). If the static resource service is normal, you can pay attention to the error rate of the URL requesting the image, and reversely infer whether it is a problem with the image itself.

In terms of quantitative performance optimization, how can we help enterprises to customize? Chen Wu believes that all URLs required for the critical path should be connected in series, and the health indicators of the service should be viewed from the overall critical path, rather than focusing on all URLs. For example, through network performance monitoring, developers do not need to pay attention to all URLs. Different developers focus on different core businesses, and everyone focuses on different URLs. For example, in the e-commerce scenario, a key path is for users to log in, open the product, enter the details, and then place an order to pay. By integrating all the URLs of the corresponding critical path together and ensuring the performance of this key link, the service and stability of the core business can be strengthened.

APM Availability Detection Method

▲ The monitoring system of Alibaba Baichuan Code Power APM

To enhance the availability of applications, APM generally adopts the form of combining application monitoring with service monitoring, so that developers can achieve end-to-end full-link performance management. In the MaLi APM monitoring system, Alibaba technical expert Xiong Qi introduced MaLi APM's application monitoring, service monitoring, database, and message push performance monitoring in the monitoring system, which is mainly completed in the following ways:

★ In application monitoring, performance data on memory, CPU, crashes, network, etc. of iOS and Android applications are collected;

★ In terms of service monitoring, it supports performance testing of Tomcat, Jetty, JBoss containers and frameworks such as Spring and Struts;

★ Support performance testing of SQL databases such as MySQL and NoSQL databases such as Redis and Mem cache;

★ MaLi APM also provides performance testing that supports Taobao message service TMC, distributed framework Dubbo, and Taobao API calls.

After data collection, they will be uniformly entered into a storage system and log system that can carry massive amounts of data. The statistical system will use the landed data to complete data calculations and generate reports to help developers track the performance of applications and services over the long term. The alarm system will send out instant alarms such as text messages and emails according to the rules when problems occur, thereby helping developers solve problems in a timely manner and reduce losses.

Usability Measurement Methods - Performance

[[174377]]

When developing an application, program errors, main thread freezes, and crashes caused by resource usage exceeding system limits are the most serious problems that need to be solved first.

Usually developers will use simulators, instruments or automated tests to find some problems, but the tests often fail to cover the devices, networks and other environments in the user's usage scenarios. If you use social media or email feedback channels, although you can get limited real user feedback, users often cannot clearly describe the information needed to reproduce the problem, and the cost of back-and-forth communication is extremely high. Therefore, on the client side, MaLi APM collects application crash information through the following detection methods.

In the signal capture mode, Mali APM uses sigaction to set the callback when the signal is interrupted, so that the corresponding crash log can be generated in the callback according to the program running status. In addition, for SIGARBT (abnormal termination), we also need to use NSSetUncaughtExceptionHandler to obtain the stack of the uncaught exception to complete the crash information.

Then, the crash log is reported to MaLi APM, which aggregates the same type of crashes based on the crash log stack information and writes them to the data storage. At the same time, the alarm system can issue an immediate alarm based on rules such as the number of crashes and the crash rate.

In addition, Mali Apm provides a dSYM reporting script. By adding the script to the build phrase of Xcode, the dSYM file can be automatically reported after successful compilation. By parsing the dSYM file and re-aggregating it and writing it to the data storage, aggregation can reduce the number of database rows by up to 90%; at the same time, crash log symbolization is also realized. Without relying on the mac environment symbolization, better use of cloud computing platforms to serve more developers.

The second technology is jam detection, which is based on RunLoop. RunLoop Observer monitors the changes in the status of the main thread RunLoop. Here, RunLoop is regarded as an athlete running laps on the playground, and Before Sources is regarded as the starting point of each lap. At the same time, another thread is started as a timekeeper to determine whether the RunLoop has run a lap every 5 seconds. If the RunLoop does not complete a RunLoop within 5 seconds, it is considered that the main thread is jammed. After the main thread jams, a jam log will be generated. If it is a recurring jam, you can choose not to report it repeatedly.

In addition, we will dynamically adjust the thresholds for different device operation periods, such as the startup phase, background phase, and idle phase, to reduce detection overhead.

For crashes that cannot be detected through signal capture or freeze detection, Mali APM introduces application abort detection. Although abort detection cannot restore the crash scene, it can reveal the existence of the problem. When the application enters the active state, Mali APM sets a flag on the persistent storage to indicate that the program is running normally. When the application exits the active state or a crash is detected, Mali APM clears the flag on the persistent storage, indicating that the program exited under known circumstances. In this way, the next time the application is started, if the flag on the persistent storage is true, it means that the application exited under unknown circumstances during the last run. In this case, Mali APM will report it as an abnormal application termination.

At the same time, in order to filter out shutdowns caused by power exhaustion, MaLi APM has also added power detection. When the battery is low, the flag is cleared to avoid false alarms.

Availability Measurement Methods - Network

[[174378]]

Network problems such as request errors, high traffic overhead, and hijacking by operators are another type of thorny problem in application development. Of course, we can also use simulators, instruments, or automated tests to discover simple network problems, but it is difficult for tests to cover complex user network environments, and it is also difficult to export network performance data for long-term comparative monitoring. If we use manual tracking to record network performance, on the one hand, we need to deal with multiple system network interfaces, and on the other hand, we need to synchronize application network code and tracking code, and the maintenance cost will remain high.

In order to monitor the performance of applications in real network environments, MaLi APM has introduced network performance monitoring with invisible embedding points, and introduced three injection technologies in network detection to help developers monitor the network performance of applications in the long term and optimize the product user experience.

The first is Method Swizzling. Each NSObject class contains an isa pointer pointing to an objc_class structure, and each objc_class structure contains a methodLists pointer pointing to an array of objc_method_list structures, which contains an objc_method structure member, and each objc_method contains a method_imp pointer pointing to the method implementation.

Therefore, as long as we can modify the value of method_imp, we can replace the original implementation. In <objc/runtime>, obtain the objc_method structure pointer through class_getClassMethod and class_getInstanceMethod, and then obtain the original implementation address originIMP of the method through method_getImplementation. Then, in the parameter block of the new implementation imp generated by imp_implementationWithBlock, call the original implementation, and you can add network performance tracking behavior before and after the original behavior. Finally, call method_setImplementation to replace the method implementation. In this way, any call will use the new implementation.

The second technology is Proxy. In Objective-C, NSProxy is the only root class besides NSObject. NSProxy is an abstract class that implements the NSObject protocol. Its normal operation requires subclasses to override the -methodSignatureForSelector: method to provide a method signature for sel, and the -forwardInvocation: method to complete the call forwarding.

Use Proxy to inject callbacks to delegates such as NSURLConnection and NSURLSession. Specifically, when the delegate proxy receives a message, if it is not the target protocol method, it will be forwarded to the original delegate through the message forwarding mechanism; if it is the target protocol method, the proxy implementation will be directly called, and the proxy implementation will delegate the call to the original delegate; in addition, most protocols and protocol methods are optional, so the -conformsToProtocol: and -respondsToSelector: methods need to be implemented in the proxy implementation to declare the additional protocols and methods added by the proxy. In this way, we can increase the network performance tracking logic without affecting the original callbacks.

The third technique is fishhook. Use fishhook to replace the C function implementation in the dynamic link library, specifically the related functions in CFNetwork and CoreFoundation. Here, the dynamic link is explained with the model of driving. Imagine a novice driver driving from Paris to Rome. Because he doesn't know the route, he first consults an experienced driver; the experienced driver tells him the correct route. This time he may take a detour, but next time, he will follow the advice of the experienced driver and drive directly to Rome.

Correspondingly, when the program is running, the address of the dynamically linked C function dynamic(...) is recorded in __la_symbol_ptr under __DATA segment; initially, the program only knows the symbol name of the dynamic function but not the implementation address of the function; when it is called for the first time, the program obtains the binding information through __stub_helper in __TEXT segment, and updates the symbol implementation address in __la_symbol_ptr through dyld_stub_binder; in this way, when it is called again, the implementation of the dynamic function can be directly found through __la_symbol_ptr; if we need to replace the implementation of the dynamic function, we only need to modify __la_symbol_ptr. For the specific implementation method, please refer to Facebook's open source framework fishhook.

Optimization measures to enhance availability

Through the above two detection methods, most of the performance and network requirements can be basically met, allowing developers to meet the stringent needs of users in today's mobile Internet. So, after establishing a measurement system and understanding specific problems, how should we solve these problems to improve usability?

1. Cybersecurity

Operator and DNS hijacking are thorny issues in application development, and there are many solutions. Wang Rui, technical director of 51 Credit Card, believes that as a financial product, 51 Credit Card will be given priority for security considerations. The solution is mainly based on the full-stack HTTPS solution, but it will bring some cost and performance losses. It is even possible to use HTTP2.0 like some solutions such as FaceBook and Google, which depends on the company and the developer to evaluate the cost of implementation. Wang Rui also introduced an early transitional solution, which is the DNS method of HTTP. By obtaining an IP table and connecting directly through IP, the problem of HTTP hijacking can be avoided.

The network is an end-to-end technology. Chen Wu, a senior technical expert at Alibaba, believes that from the perspective of e-commerce, the stability of the server must be guaranteed first. The server can have strategies such as anti-brush, current limiting, unitization, remote disaster recovery, and service degradation to ensure the stability of the connection. In addition, the client's perspective mainly looks at the connection link and data volume. The resources in the link can be backed up by multiple CDNs, and anti-hijacking can be achieved through HTTP DNS or HTTPS, HTTP2.0. On the basis of link stability, the efficiency of transmission can be guaranteed. This can be achieved through technologies such as local access, connection multiplexing, improving compression rate, and using binary protocols to reduce packet size. Of course, the most important thing here is the end-to-end network monitoring system, which will be more effective in network service governance.

2. System downgrade

The degradation solution is the last line of defense for system performance. From the perspective of performance optimization, there is no 100% perfect design. There will always be some unexpected situations that lead to performance degradation. Therefore, when designing the system, degradation design must be done well.

Wang Chaocheng, Chief Architect of Ele.me Mobile, believes that during the Ele.me 517 promotion, the server side was under great pressure, and some services were downgraded to ensure the normal operation of the promotion and flash sales. However, on the user side, as well as the APP, user requests and data were still being actively sent, which increased the pressure on the server cluster. At this time, Wang Chaocheng said that they would consider downgrading some of the services on the SDK or APP to reduce the pressure on the server side in analyzing data.

Degradation is divided into manual degradation and intelligent degradation. In terms of strategy, it is divided into traffic degradation, effect degradation, and functional degradation. Traffic degradation is mainly manifested in the unavailability of services for some users by actively refusing to handle some traffic breakfasts. Effect degradation and functional degradation are both manifested as degradation of service quality. One is to ensure the service availability of all users by using relatively low-quality and low-latency services during peak traffic periods, and the other is to improve the service availability of users by reducing functions.

3. Network performance

From the perspective of data structure, it is necessary to select the appropriate data structure according to different business scenarios. When the data traffic is small, there may not be any difference on the client. When the data traffic is too large and the data structure is complex, it is likely to directly affect the performance of the APP.

For applications like "Ele.me" in the catering industry, the frequency of data transmission makes the data volume very large, which may not be perceived by users, but merchants receive a large number of orders, and the data volume has a great impact, which is more noticeable. Wang Chaocheng believes that some new protocols (Protobuf, Flatbuf) can be considered to optimize the data volume. For example, HTTP2.0 can compress the header of the http protocol, use encoder to reduce the size of the header that needs to be transmitted, and cache a header fields table by both parties in communication. For the same data, it is no longer sent through each request and response, which reduces the size that needs to be transmitted. Another is to adopt a binary protocol that only recognizes the combination of 0 and 1. By repackaging the original header and body of http1.x with frame, it is convenient and robust. Through content compression and concurrent transmission mechanism, under low-speed and unstable wireless conditions, the size of the http body sent is reduced, improving user experience and resource efficiency.

▲ Relationship between http1.x and http2.0 protocols

At the same time, Chen Wu, a senior technical expert at Alibaba, also said that if there is no problem with the link, then the entire network transmission layer must be as fast as possible, otherwise timeouts will easily occur. Therefore, the first thing to do is to reduce the compression of the packet header through HTTP 2.0 in the protocol layer, support push messages from the server, and use dual-channel channels to multiplex channels faster. The second is to use binary compression at the data layer. When the entire network connectivity rate is low, the package will be split into small packets to achieve a good transmission effect.

4. Dynamic hot fix

The so-called hot fix is to use hot patch dynamic repair technology to send patches to users to fix some fatal bugs without them noticing. Wang Rui, the person in charge of 51 Credit Card Client, believes that the biggest problem on mobile clients is the release of versions. For iOS users, the entire repair process is relatively long. It needs to be submitted for review, but many users may have been missed during this period. He believes that hot fix technology can quickly and timely perform online repairs, and the repair process is usually completed during the use process.

In terms of hot fix technology, Android often uses the Android dex subpackaging solution, while iOS can use JSPatch, which allows you to write native iOS apps in JavaScript. You only need to introduce a very small engine into the project, and you can use JavaScript to call any native Objective-C interface.

Summarize

The performance optimization methods mentioned above are basically to solve the problems caused by three situations: 1. The increasingly complex business leads to the sudden fatal bug repair methods caused by the continuous iteration of functions; 2. The growing number of users and expanding data lead to excessive traffic; 3. Network security and memory overhead issues.

This article analyzes the mode of mobile performance optimization through different scenarios, and can solve a certain type of problem by determining the scenario. Of course, we cannot just understand the problems and means solved by performance optimization, but more importantly, we need to understand the scenario where the problem occurs, the cause and the cost required.

[51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites]

<<: Zscaler: iOS apps leak more user data than Android apps

>>: 7 Linux command line tools you may never have heard of but are extremely useful

Science and Technology Illustrated | 2023 CIFTIS Empowers "China Construction"

Subsidy standards for rural village cadres over 60 years old in 2022: How much per month? Attached is the latest salary standard for village cadres!

Blog

Ten pictures tell you how traditional enterprises integrate with the Internet

Blog

How much does it cost to be an agent of Xianyang e-commerce mini program? What is the price for being an agent of Xianyang e-commerce mini program?

Is it easy to be an agent of Xianyang e-commerce ...

APM from entry to abandonment: Analysis of availability monitoring system and optimization methods

Science and Technology Illustrated | 2023 CIFTIS Empowers "China Construction"

Crazy Douyin free traffic card live room square

Real estate advertising, a good era has passed

KOL marketing promotion, how does KOL bring products?

Einstein's famous letter didn't play much of a role in bringing about the atomic bomb

National Nutrition Week | These good dietary habits will help you eat healthily

It’s been 10 years! This is the first code of WeChat

The entire process of new product launch promotion plan!

Subsidy standards for rural village cadres over 60 years old in 2022: How much per month? Attached is the latest salary standard for village cadres!

Ten pictures tell you how traditional enterprises integrate with the Internet

Recommend

China's "Big Goose", the world's first

Life Encyclopedia丨@What is the effective bed-staying posture for patients who have difficulty getting up?

Today is Grain Rain丨Late spring is coming to an end and early summer is coming

Guangdiantong Business Interest Targeting: Showing the right ads to the right people

How much does it cost to be an agent of Xianyang e-commerce mini program? What is the price for being an agent of Xianyang e-commerce mini program?

2021, we looked at these photos again and again

Betting on TV system differentiation, where is the smart TV industry heading?

Increase fans! 6 practical ways to increase followers, I will teach you all of them today!

An inventory of the most comprehensive marketing methods of the 200 billion Pinduoduo

Faced with 530 million yuan in investor claims, can Guoan still remove the ST label?

Why are Internet giants aggressively attacking the film industry?

Science Illustrations | From the ground to space, why is "Beijing Time" more accurate?

Health Tips | Don’t want to go out in winter? Recommend an “exercise” that you can do without moving, easy and efficient

Northerners are generally fatter than southerners. Can we blame the geographical differences?

Tesla's deliveries in Q1 2022 reached a record 310,000 vehicles, up 67.5% year-on-year