How to evaluate the stability and quality of an App?

How to evaluate the stability and quality of an App?

"Crash", like "stuttering" and "abnormal exit", are three common situations that affect App stability. Relevant data shows that when the crash rate of iOS exceeds 0.8% and the crash rate of Android exceeds 0.4%, the number of active users has a significant downward trend. It will not only cause negative impacts such as interruption of key businesses, decline in user retention rate, and deterioration of brand reputation, but will also directly lead to uninstallation and loss. It also brings capital losses to developers that cannot be underestimated.

So, does an app with a low crash rate have high quality? Can we directly judge the stability of an app by its crash rate?

First of all, we need to define a unified caliber to measure the quality of an App, that is, which indicators can be used as the evaluation caliber of stability? Taking the concept of stability rate defined by U-APM of Umeng+ as an example, the stability and quality of an App are generally evaluated from the following three points:

  • A crash occurs, such as a Java crash or a Native crash. The crash rate is used to evaluate the calculation.
  • Abnormal exit, such as: low memory killer, crossing out of the task list, system abnormality, power outage, user-triggered shutdown/restart, etc., is evaluated and calculated using the abnormal rate indicator.
  • Crash means that the program has an exception, causing the program to exit. This includes:
  • Java crash means that an uncaught exception occurs in the Java code, causing the program to exit abnormally, such as null pointer exception, array out-of-bounds exception, etc.
  • Native exceptions refer to errors in the native code that generate corresponding signals, causing the program to exit abnormally, such as accessing illegal addresses, address misalignment, etc.

Java crash capture is relatively simple, while Native crash capture may require us to have a certain grasp of the underlying knowledge of the system. We know that Android is based on the Linux system, and most crashes in the system are caused by coding errors or hardware errors. When the system encounters an unrecoverable error, it will trigger the exception handling process through abnormal interrupts, and the processing of these interrupts is unified as semaphores. When an application receives a certain semaphore, it will be processed according to the kernel's default action, such as Term, lgn, Core, Stop, and Cont. At the same time, we can also register to receive signals through sigaction to specify processing actions, such as capturing crash information. Of course, there will be some difficulties in the capture process, especially in extreme environments, such as when the stack overflows, because the stack space has been used up, our signal processing function cannot be called, so that the crash information cannot be captured. At this time, we need to consider using signalstack, so that our signal processing function can allocate a piece of memory space in the heap as a "replaceable signal stack" to process the crash information.

Of course, in addition to stable and secure capture capabilities, we also need to enrich the contextual information of the crash scene, such as Logcat information, call stack information, device information, environmental information, etc., to provide a comprehensive reference for our subsequent positioning and problem solving.

In the case of a crash, we use the crash rate as a data indicator. It includes:

  • UV crash rate, that is, deduplicated users with crash errors/deduplicated active total users;
  • PV crash rate, that is, the number of crash errors/number of starts;

The startup crash rate, that is, the crash that occurs during the startup process of the application, is an easily overlooked but very important crash indicator. This is because startup is a very important stage in the APP life cycle. Many advertisements, splash screens, activities and other contents are exposed in this process. At the same time, various initializations need to be loaded during startup. If an error occurs at startup, hot repair and downgrade disaster recovery strategies are often unable to make up for it.

ANR, or Application Not Responding, is a pop-up dialog box when an application fails to respond in time for a period of time, allowing the user to choose to continue waiting or force close. From the perspective of user experience, sometimes ANR may bring a worse experience than a crash, so developers should pay attention to ANR as well as crashes.

The accuracy of ANR capture has been a process of continuous upgrading and improvement. In the early days, we used FileObserver to monitor the changes of the /data/anr/traces.txt file to capture and report, but unfortunately, with the version upgrade, the system and manufacturers began to tighten the permissions of system files. The coverage of this solution has become increasingly limited, causing the accuracy of ANR capture to continue to decline.

We then improved the method of capturing ANR by monitoring the running time of the message queue, that is, putting an empty message into the main thread Looper and monitoring whether the empty message is executed after 5 seconds. However, this solution cannot truly capture the ANR situation (there are missed reports and false reports), and it is also impossible to obtain the complete ANR content. Later, we referred to the implementation principle of Android ANR and implemented a real-time and accurate ANR capture solution that is compatible with all system versions. We know that after the system's system_server process detects that an ANR occurs in the APP, it will send a SIGQUIT (signal 3) signal to the process where the ANR occurs. By default, the system's libart.so will receive the signal and call the dump method of the Java virtual machine to generate traces.

By intercepting SIGQUT, we receive the signal first when ANR occurs, and generate traces and ANR logs. After processing the signal, we pass the signal to the system to generate traces files. When generating traces files, we ensure the consistency of the content with the native system and significantly improve the speed of generating traces files, effectively avoiding the possibility of being killed again by system_server using SIGKILL (signal 9) due to the long time of generating traces. At the same time, we enrich the captured content, including: the cause of triggering ANR, the CPU usage of the top process in the mobile phone, the CPU usage of the top thread in the ANR process, the distribution of CPU core processing time, the waiting time of disk IO operations and other important information, which provides more powerful support for analyzing, locating and solving ANR problems!

Similarly, for the case of ANR, we also divide it into UV ANR rate and PV ANR rate. The algorithm can refer to the calculation of the crash rate above.

Of course, in addition to crashes and ANRs, we often ignore the abnormal exit scenario, but often through abnormal exits we can find problems that cannot be captured normally, such as low memory killer and system restart. For example, compatibility issues cause flash backs, device restarts, and third-party libraries actively call exit functions, which leads to an increase in the number of application flash backs, and other difficult-to-find problems. Therefore, through the abnormal exit rate, we can have a more comprehensive understanding and measure the stability of the application.

In summary, I think everyone should have the answer to the question at the beginning of the article. Of course, we should not circumvent certain problems by manually try catch in order to cover up code quality problems. This may interrupt the normal use of users and cause perceived blocking feedback. We should start from the real perception of users when using the APP and capture and handle problems in a timely manner when problems occur.

The stability of an app is a long-term, iterative process. U-APM is a good tool to improve efficiency and reduce costs during this process. It provides the ability to collect, parse, aggregate, and analyze. In the next issue, we will explain how to use U-APM to solve and handle crashes, ANR, and other issues. Stay tuned.

<<:  Canonical chooses Flutter to build future Ubuntu apps

>>:  Common design mistakes in WeChat app development

Recommend

A preview of the top five trends on Xiaohongshu during Double Eleven!

The closing ceremony of the Tokyo Olympics ended ...

A comprehensive guide to optimizing information flow ads!

Q1. My advertising has always been low in volume....

An article to help you understand Baidu search promotion creativity

What is creativity? Bidding advertising creativit...

3 points of analysis on community operation

Community is a concept that is familiar to all fr...

Brand marketing promotion: 6 tricks for brand naming!

01 The success of a big brand can cover up many i...

Baidu search promotion ocpc daily optimization guide

When Baidu ocpc was first promoted in 2018, not m...

iOS Developer Notes: WatchKit Development Tips

[[141038]] Since mid-January I have been working ...

Can fitting rooms become an important O2O scene in the clothing industry?

Regarding the Uniqlo nude photo scandal, as the f...

Li Miao: Quantum Mechanics for Children

: : : : : : : : : : : : : : :...

How does product operation guide new users to pay?

The ultimate goal of product operation is to brin...