Analysis of the technical principles of mobile terminal monitoring system

Analysis of the technical principles of mobile terminal monitoring system

[[184536]]

In such an era that focuses on user experience, APM technology is developing rapidly, and it is flourishing in China. Recently, I have conducted a survey on the APM products of various companies and carried out my own practice on this basis. Here, from the perspective of iOS, I will talk about my understanding of mobile APM technology and provide corresponding examples.

What is APM?

The full name of APM is Application performance management, which monitors the reliability and stability of applications to quickly fix problems and improve user experience.

Major domestic companies have their own monitoring systems, which may be developed by themselves or provided by a third party. Of course, in this era where data is king, many powerful companies tend to develop their own and master core data. Representative APM products include: Tingyun, Alibaba Baichuan, Tencent Bugly, NewRelic, OneAPM, NetEase Cloud Capture, etc.

Speaking of monitoring, what are the indicators we are concerned about? As shown below

  • Network request: success rate, status code, traffic, network response time, DNS resolution of HTTP and HTTPS, TCP handshake, SSL handshake (except HTTP), first packet time, etc.
  • Interface freeze, freeze stack
  • Crash rate, crash stack
  • Abort rate: This refers to the number of times the system kills the application due to reasons such as excessive memory usage.
  • Interaction monitoring: page loading time, page interaction traces
  • Dimension information: region, operator, network access method, operating system, application version, etc.
  • Others: memory, frame rate, CPU usage, startup time, power, etc.

Let’s talk about the principle

Stuttering detection

When an application freezes, it is usually accompanied by frame drops, so the frame rate is the easiest indicator to judge freezes. For offline test environments, we can use the frame rate to give developers some prompts, telling them that freezes may have occurred. However, the frame rate is highly unstable, so another way to detect freezes is generally used. That is Runloop. For details, you can view the Runloop source code, and you will find that the processing of events is mainly between the kCFRunLoopBeforeSources and kCFRunLoopBeforeWaiting states, and after kCFRunLoopAfterWaiting. Then we can monitor the two states. If it takes too long, it means freezes have occurred.

Ali Baichuan

The above picture is taken from Alibaba Baichuan. As shown in the picture, we will make a judgment on the number of freezes. If the number is 1 but the time is exceeded, it is a single freeze that takes a long time. If the number reaches the threshold, it proves that it is a continuous short-term freeze.

When the jam occurs, we will collect a stack trace at that time for location. You can use PLCrashReporter to do this, or you can develop a stack trace collection library yourself (refer to http://www.jianshu.com/p/7e4c7b94ca36 to do it)

For examples, there are already many open source projects on the Internet, you can refer to https://github.com/suifengqjn/PerformanceMonitor

Crash Detection

Crashes are usually caused by Mach exceptions or Objective-C exceptions (NSException). We can capture the corresponding Crash events for these two situations.

Mach exception capture

If you want to do mach exception capture, you need to register an exception port. This exception port will be valid for all threads of the current task. If you want to target a single thread, you can register your own exception port through thread_set_exception_ports. When an exception occurs, the exception will first be thrown to the thread's exception port, and then try to throw it to the task's exception port. When we capture the exception, we can do some of our own work, such as current stack collection.

For how to register an abnormal port, here is a schematic diagram and PLCrashReporter https://github.com/plausiblelabs/plcrashreporter for reference

Unix signal trapping

For Mach exceptions, the operating system will convert them into corresponding Unix signals, so if you are not familiar with Mach, you can also register a signalHandler to handle signal exceptions. For examples, you can refer to https://github.com/xcysuccess/iOSCrashUncaught

  1. signal(SIGHUP, signalHandler);
  2.  
  3. signal(SIGINT, signalHandler);
  4.  
  5. signal(SIGQUIT, signalHandler);
  6.  
  7.   
  8.  
  9. signal(SIGABRT, signalHandler);
  10.  
  11. signal(SIGILL, signalHandler);
  12.  
  13. signal(SIGSEGV, signalHandler);
  14.  
  15. signal(SIGFPE, signalHandler);
  16.  
  17. signal(SIGBUS, signalHandler);
  18.  
  19. signal(SIGPIPE, signalHandler);

NSException Catching

NSException is also easy to handle. You can register NSUncaughtExceptionHandler to capture the exception information, write the obtained NSException details into the Crash log, and upload it to the background for data analysis.

  1. // register the uncaught exception handler
  2.  
  3. SetUncaughtExceptionHandler(&handler);

Abort rate detection

Currently, there is no way to directly count the number of times the server is killed due to excessive memory usage. Generally, the percentage statistics are done through the elimination method. The principle is as follows

  • Program starts, set flag
  • The program exits normally, with clear signs
  • Program Crash, Clear Sign
  • The program shuts down due to low battery. This cannot be monitored directly. You can add power detection to assist in judgment.
  • The second time you start up, if the flag exists, it means Abort once and upload the data to the backend for statistics.

Ali Baichuan

Interaction Monitoring

As for the page loading time, this is relatively easy to implement, directly through the corresponding life cycle method of Runtime hook, such as viewDidLoad, viewWillAppear, etc.

For user interaction traces, such as which button was clicked, which page was jumped to, this information is biased towards the collection of user behavior. We have also independently developed a non-embedded SDK, which is specifically used for the collection and analysis of user behavior data. The core is also based on the idea of ​​hook AOP. For details, please refer to my colleague's work

Network Monitoring

For success rate, status code, traffic, and network response time, we can mainly do it in two ways:

  • Hook for URLConnection, CFNetwork, and NSURLSession. The specific hook technology can be method swizzle or Proxy, Fishhook, etc.
  • You can also use NSURLProtocol to intercept network requests and get information such as traffic and response time, but NSURLProtocol has its own limitations. For example, NSURLProtocol can only intercept NSURLSession, NSURLConnection, and UIWebView, but it can't do anything about CFNetwork.

For the first method, you can refer to this picture to see which methods can be hooked.

It is a little difficult to count the DNS resolution, TCP handshake, SSL handshake (except HTTP), first packet time, etc. of HTTP and HTTPS.

However, because the underlying layers of URLConnection, CFNetwork, and NSURLSession we use are all BSDSocket, we can try to manipulate the socket to achieve the effect, similar to the method of using the ViewController lifecycle method to count the page loading time. We can do this by hooking the socket-related methods, such as hooking the connect method when the socket is connected to get the start time of the TCP handshake, and hooking the SSLHandshake method to get the start time of the SSL handshake when the SSLHandshake is executed. Currently, Tingyun has provided the HTTP segment time query function, everyone can go and experience it

  1. int   connect ( int , const struct sockaddr *, socklen_t) __DARWIN_ALIAS_C( connect );
  2.  
  3. OSStatus SSLHandshake(SSLContextRef ctx)

However, Apple added ATS new features to iOS 9 and required developers to use HTTPS. When I requested the Hook socket method for the HTTPS network on iOS9 and 10, some method hooks failed. I guess it should be because Apple has strengthened and encrypted it, which makes it impossible to hook some system methods. Therefore, it is impossible to obtain the segment time of the HTTPS network through the socket on iOS9 and 10.

However, Apple launched an API in iOS 10 that can collect network information in iOS 10 and above.

  1. - (void)URLSession:(NSURLSession *)session task:(NSURLSessionTask *)task didFinishCollectingMetrics:(NSURLSessionTaskMetrics *)metrics

The print results are as follows

  1. ( Fetch Start) 2017-02-24 09:03:06 +0000
  2.  
  3. (Domain Lookup Start) 2017-02-24 09:03:06 +0000
  4.  
  5. (Domain Lookup End ) 2017-02-24 09:03:06 +0000
  6.  
  7. ( Connect Start) 2017-02-24 09:03:14 +0000
  8.  
  9. (Secure Connection Start) 2017-02-24 09:03:14 +0000
  10.  
  11. (Secure Connection   End ) 2017-02-24 09:03:16 +0000
  12.  
  13. ( Connect   End ) 2017-02-24 09:03:16 +0000
  14.  
  15. (Request Start) 2017-02-24 09:03:16 +0000
  16.  
  17. (Request End ) 2017-02-24 09:03:16 +0000
  18.  
  19. (Response Start) 2017-02-24 09:03:16 +0000
  20.  
  21. (Response End ) 2017-02-24 09:03:16 +0000

Of course, if you have a good solution for obtaining the time at each level of the network, I hope you can leave a message to let me know. At the same time, it is easy to obtain some basic indicators such as dimension information and memory, so I will not go into details here.

Gift Pack

In the process of researching and learning APM technology, I found many excellent blogs, so I recommend them to you. If you need them, you can take them by yourself.

  • Mogujie mobile terminal full-link tracking and guarantee system

http://t.cn/R5whClL

  • Meituan Takeaway Mobile Terminal Performance Monitoring System Implementation

http://t.cn/RIUcX0o

  • WeChat Reading iOS Quality Assurance and Performance Monitoring

http://t.cn/RibKdFW

  • NetEase NeteaseAPM iOS SDK technical implementation sharing

http://t.cn/R5ZyWVt

  • Alibaba Baichuan MaLi APP monitoring has arrived. A heavyweight player has entered the APM market

http://t.cn/RfjDrvt

  • APM Best Practices Series Article Collection

http://t.cn/RxZQOto

  • Taobao Mobile: Rapid Operation and Maintenance Delivery Practice for an App with 100 Million Users

http://t.cn/RibFFYO

<<:  How to build an Android MVVM application

>>:  Understanding Code Obfuscation in Android

Recommend

Apple held its first conference this year, and Cook revealed some information

On March 1, Apple held its first conference in 20...

[One Belt, One Road Story] The "pioneers" heading towards the sun

"A lone smoke rises straight in the desert, ...

Self-taught! How sophisticated are today’s AI deception methods?

Many studies have shown that today's AI can a...

10 classic marketing planning cases!

When doing marketing planning , there are three t...

How to optimize advertising creatives?

“It is as important to have a teacher as it is to...

Is Windows 9 really going to be free?

Recently, there have been a lot of reports about ...

Arrow points to manned lunar landing! Long March 10 is on schedule

The manned space station project has fully entere...

Reflections on China's auto recalls without dealers and suppliers

Once a brand is delisted, even if it can be reorg...