Mobile domain full-link observability architecture and key technologies

App Existing Architecture Challenges

Since 2013, when Alibaba Group started to go all-in on wireless technology, the company has been developing mobile technology for more than a decade, going through several key stages:

In the first phase, we solved the pain points of large-scale concurrent business development and defined the Atlas architecture (a containerized framework that provides support for component decoupling and dynamism).
In the second phase, we will build ACCS (Taobao wireless full-duplex, low-latency, high-security channel service) long-duplex encrypted network capabilities to complete end-to-end interoperable mobile service capabilities and catch up with the industry;
In the third stage, dynamic R&D frameworks such as Weex and mini programs are built based on business characteristics, and mobile technology enters a dynamic cross-platform era.

In the middle and late stages, the various BUs will be connected and capabilities will be built together through the Ali Mobile Group mechanism. Since then, the mobile infrastructure has basically taken shape, and several groups have been accumulated in various fields to achieve capability reuse. The App has basically formed a three-layer architecture of upper-layer business, intermediate R&D framework or container, and basic capabilities. As the builder of the wireless terminal infrastructure, our team used to focus on the basic capability construction of the group's mobile terminal. In recent years, the team has focused on performance optimization in Taobao business scenarios. Through the experience optimization project, the App architecture and related call links were analyzed horizontally, and the following common problems were found in the group's App:

(Figure 1 Taobao App Architecture Challenge)

Low O&M troubleshooting efficiency: First, in the monitoring stage, most problems are not monitored or the information reported by the monitoring cannot support more effective analysis, and it is necessary to rely on logs for problem troubleshooting; second, there is the problem of no logs. When an exception occurs, the logs are not uploaded actively and need to be retrieved manually. If the user is not online, the logs cannot be retrieved. After the logs are retrieved, the problem of not being able to read the logs will continue to occur; for links related to the server, the problem of Eagle Eye logs on the server being saved for only 5 minutes will occur. After this round, half a day has basically passed...
End-to-end tracing is incomplete: In a complete business link, traffic will pass through multiple end-to-end layers. For example, once an order is placed, after the network request triggered by the client reaches the server, it will be processed by several client modules, trigger N backend application calls, and experience the instability of the mobile network. Imagine which problems in these calls will affect the order transaction, which steps will slow down the entire processing flow, and whether the request is not returned and it is unclear whether it is a server problem or a network problem. If the full-link performance of each call is not clearly defined, it means that the problems at each layer cannot be fully exposed. These factors need to be considered. In addition, the naturally asynchronous calls on the end side lead to major challenges in measuring each stage and opening up the full link. The current situation is that there is no unified calling specification at each layer of the client, and there is a lack of topology structure, which makes it impossible to restore the calling link, resulting in the inability to track the end-to-end.
Optimization lacks a unified caliber: In the past, because the performance calibers of various R&D frameworks were closed-loop, both client-native technology and cross-platform technology were uniformly collected from a technical perspective. This naturally led to huge differences in the implementation and performance of various businesses. In layman's terms, it was not close to the user's experience, which made it difficult for online data to reflect the real situation and the trend of advantages and disadvantages. For a long time, Taobao's experience has been deteriorating. Every year, we have to rely on a campaign-style approach to optimize the experience, which cannot be maintained on a regular basis.
Mobile PaaS process enabling costs: After a large number of SDK components are exported to various BUs of the group and the basic capabilities are embedded in different App host environments, the above-mentioned problems will also be encountered. For students in each BU, the infrastructure is even more of a black box. If the problem involves the infrastructure, the troubleshooting process will be more difficult. In addition, there are no existing tools for self-diagnosis of the problem. When encountering problems, they can only come for consultation, and various groups and people are pulled in, resulting in high cost of answering questions.

The above are some thoughts on the shortcomings of the current client in terms of operation and maintenance troubleshooting, measurement monitoring, and full-link optimization from the perspective of APP structure, which is also our subsequent direction of effort.

Observable system

The Evolution of Monitoring to Observability

Observability is a set of concepts and systems. There are no specific requirements for technical implementation. The key is to introduce this concept and apply it to our business iteration and problem insight. Traditional operation and maintenance may only provide us with the top-level warnings and anomaly overviews. When deeper error information positioning is required, people are often recruited through group building, and then the characteristics of the problem are first manually searched. Even the development of a certain module is responsible for analyzing the dependencies of various modules. Problem handling basically involves more than three roles (business, testing, development, architecture, platform, etc.).

Compared with traditional monitoring, observability can combine data and organically link data together to create better connections, helping us better observe the operation of the system and quickly locate and solve problems. "Monitoring tells us which parts of the system are working, and observability tells us why they are not working." The following Figure 2 illustrates the relationship between the two. Monitoring focuses on the macro-market display, while observability includes the scope of traditional monitoring.

(Figure 2 The relationship between monitoring and observability)

From the above figure, the core is to observe the outputs of each module, key calls, dependencies, etc., and judge the overall working status based on these outputs. The industry usually summarizes these key points as Traces, Loggings, and Metrics.

Observability Key Data

(Figure 3 Observability key data)

Combining the definitions of Traces, Loggings, and Metrics with the current situation of Taobao, we made some interpretations:

Loggings: Based on the existing TLOG (wireless end-to-end logging system) log channel, it displays events generated when the App is running or some logs generated during the execution of the program. It can explain the operating status of the system in detail, such as page jumps, request logs, global CPU, memory usage and other information. Most logs are not connected in series. Now that the structured call chain log is introduced, the log can actually be converted into Trace after the call chain scenario is structured, supporting single-machine troubleshooting;
Metrics: It is an aggregated value, which is used for macro-market purposes. It lacks detailed display for problem location. It generally has various dimensions and indicators. The amount of metric data is usually large, and some sampling control is required for specific scenarios.
Traces: It is the most standard call log. In addition to defining the parent-child relationship of the call (usually through TraceID and SpanID), it also generally defines detailed information such as the service, method, attribute, status, and time consumption of the operation. Trace can replace part of the functions of Logs. In the long run, the aggregation of Trace can also obtain the metrics of each module and method, but the log storage is large and the cost is high.

Full-link observability architecture

The above observability system concept has been put into practice in the backend. However, considering the characteristics and current situation of the mobile field, there are various problems as follows:

The problem of calling specifications: The difference between the client and the cloud is that the client is completely asynchronous, the asynchronous APIs are extremely rich, and there is no unified calling specification;
Problems with multiple technology domains: There are a large number of R&D frameworks, and their capabilities are black boxes to the outside world. There are a lot of imperceptible costs in connecting them together.
The problem of differences between the end and the cloud: The massive distributed devices on the end side mean that the challenges of observable modes are also fundamentally different from those on the server side. Logging and metrics can be fully reported and implemented on the server side based on a set of systems, but the tracking points and log volumes of a single machine vary greatly. This is also the reason why the tracking point system and the log system on the end side are separated. The end side needs to realize how to take into account both the single-machine problem troubleshooting of massive devices and the indicator trend definition under big data.
The problem of end-to-end association: The end-to-end reality has always been in a state of separation. How to better perceive the backend status from the end-side perspective and how to make associations, such as how to continuously promote serverRT (backend request call time) from IDC (Internet Data Center) to CDN coverage, and how to make the end-side full-link identification aware of the backend.

Therefore, we need to define the entire chain in the field of mobile technology around the above issues and establish relevant field-level analysis capabilities and good evaluation standards, so that we can have a deeper insight into the problems on the mobile side and continue to serve the group's various apps and cross-domain issues in the fields of troubleshooting and performance measurement.

(Figure 4: Definition of the full-link observability architecture)

Data layer: defines indicator specifications and collection schemes, and reports data based on Opentracing (distributed tracing specifications);
Domain layer: Evolution from problem discovery to problem location, continuous performance optimization system, and technology upgrade and precipitation;
Platform layer: compare the perspectives of the group and competitors, combine online and offline indicators, introduce the manufacturer's perspective, and drive App performance improvement;
Business layer: From a full-link perspective, it connects end to end. In addition to client colleagues, it can also serve R&D personnel across different technology stacks.

Looking back at the goal of the full-link observability project, we set it as "building a full-link observability system, improving performance and driving business experience improvements, and improving problem location efficiency." Subsequent chapters will focus on explaining the practices of each layer.

Mobile opentracing observable architecture

Full link composition

(Figure 5 End-to-end situation, detailed scenario layering diagram)

The existing end-to-end link is long, and there are various R&D frameworks and capabilities on the end side. Although the back-end call link is clear, from the perspective of the entire link, it is not connected to the end side. Taking the user browsing details as an example, when the first screen is opened, it will trigger different call sequences of the three modules of Ao Chuang, MTOP (wireless gateway), and DX. Different modules have their own processing processes, and different stages have different time consumption and status (success, failure, etc.); then continue to slide, you can see that the call sequence combination of the modules is different, so different scenarios can be randomly combined by several elements. It is necessary to define the entire link by dividing several dimensions according to the actual user scenario:

Scenario definition: A user operation is a scenario. For example, clicks and slides are separate scenarios. A scenario can also be a combination of multiple single scenarios.
Capability stratification: Different scenarios include business, framework, container, and request calls, and each field can be stratified;
Phase definition: Different layers have their own phases. For example, the framework class has 4 local phases, while the request class can include the backend server processing phase;
User movement: A movement consists of several scenes.

The full link is to decompose complex large calls into a limited number of structured small calls, and various cases can be derived:

"Single scene + single stage" combined full link;
The full link of "single scene + several layers + several stages" combination;
A full link of "several scenarios + several layers + several stages" combination;
... ...

Falco-based on OpenTracing model

In order to support the industry standards of Logs + Metrics + Tracing, the full link introduces the distributed call specification opentracing protocol and performs secondary modeling on the above-mentioned client architecture (hereinafter referred to as Falco).

The OpenTracing specification is the model basis of Falco. It is not listed below. For a complete reference, please refer to the OpenTracing design specification, https://opentracing.io/docs/overview/. Falco defines the call chain tracing model in the terminal field. The main table structure is as follows:

(Figure 6 Falco data table model)

Span public header: The yellow part corresponds to the basic attributes of Span in the OpenTracing specification;
scene: corresponds to the baggage part of OpenTracing, which will be transparently transmitted from the root span downwards to store business scenarios. The naming rule is "business identifier_behavior". For example, the first screen of details is ProductDetail_FirstScreen, and the refresh of details is ProductDetail_Refresh;
Layer: corresponds to the Tags part of OpenTracing, defines the concept of layers, which are currently divided into business layer, container layer and capability layer. Modules that process business logic belong to the business layer and are named business; modules that provide view containers belong to the container & framework layer, such as DX and Weex, and are named frameworkContainer; modules that only provide one atomic capability belong to the capability layer and are named ability, such as mtop and picture. Layers can be used to compare the performance of different modules with the same layer and capability;
stages: corresponds to the Tags part of OpenTracing, indicating the stages included in a module call. Each layer is divided into key stages based on the domain model, with the aim of providing a consistent comparison standard for different modules at the same layer. For example, when comparing DX and TNode, the advantages and disadvantages of each other can be measured from the perspectives of preprocessing time, parsing time, and rendering time. For example, the preprocessing stage is named preProcessStart, which can also be customized.
Module: corresponds to the Tags part of OpenTracing, which is more of a logical module, such as DX, mtop, image library, and network library;
Logs: corresponds to the Logs part of OpenTracing. Logs are only recorded in TLog and are not output to UT tracking points.

Falco - Key Takeaways

(Figure 7 Falco key implementation)

End-side traceID: generated based on the principles of uniqueness, fast generation, scalability, readability, and short length;
Call & restore abstraction: traceID and span multi-level sequence numbers are transparently transmitted to clarify the upstream and downstream relationships;
End-to-end concatenation: The core solution is to solve the problem of cloud-to-cloud concatenation. The end-side ID is transparently transmitted to the server, and the server stores the mapping relationship with the Hawkeye ID. The access layer returns the Hawkeye ID, and the end-side full-link model has the Hawkeye ID. Through such a bidirectional mapping relationship, we can know whether an unreturned request is because it failed in the network stage, did not reach the access layer, or the business service did not return, thus making the familiar and coarse-grained network problems definable and explainable.
Layered measurement: The core purpose is to make different modules in the same layer have a consistent comparison caliber and support horizontal performance comparison after the framework upgrade. The idea is to abstract the client domain model. For example, taking the framework class as an example, although the framework is different, some key calls and analysis are consistent, so they can be abstracted into a standard stage, and the others are similar.
Structured tracking: First, column storage is used to facilitate data aggregation and data compression of large data sets, reducing the amount of data; second, business + scenario + stage are deposited into one table to facilitate related queries;
Domain problem precipitation based on Falco: including key definitions of complex problems, clue-type logs for tracking problems, and tracking points for certain special demands. All information on domain problems is structured and deposited in Falco, and domain technology developers can also continue to build analytical capabilities based on the deposited domain information. Only by achieving the unity of effective data supply and domain interpretation can deeper problems be defined and solved.

(Figure 8 Falco domain problem model)

Operation and maintenance practice based on Falco

The scope of operation and maintenance is extremely broad, focusing on the key processes of problem discovery, problem handling, location analysis, and problem repair, from indicator observation and alarm of massive equipment to single machine troubleshooting and log analysis. Everyone knows that this should be done. Each process involves the construction of a lot of capabilities, but it is difficult to implement in practice, and all parties do not agree. Taobao clients have always had problems with indicator accuracy and low log pulling efficiency. For example, taking APM performance indicators as an example, many indicators of Taobao App were inaccurate in the past, which business colleagues did not agree with and could not guide actual optimization. This chapter will focus on sharing the relevant optimization practices of Taobao App in terms of indicator accuracy and log pulling efficiency.

(Figure 9 Problem reversal of user flow and operation and maintenance system)

Macroeconomic indicator system

Taking the horizontal campaign of terminal performance as an opportunity, APM started related upgrade work based on the user's physical experience. The core involves startup, external links, and visual and interactive indicators in various business scenarios. How to make the end point corresponding to the indicator closer to the user's physical experience mainly involves the following tasks:

Upgrade of 8060 algorithm: extract and calculate visually useful elements (such as pictures and text), and remove elements that users cannot perceive (blank controls, bottom-line images). For example, formulate view visual specifications to meet the requirements of image libraries, fishbone diagrams and other custom control labeling;
H5 field: Supports visual and interactive UC page elements and front-end JSTracker (event tracking framework) backtracking algorithm, which is connected with the H5 page visual algorithm;
In-depth and complex scenarios: formulate custom framework visual specifications, connect and calibrate various R&D frameworks such as Flutter and TNode (dynamic R&D framework), and implement the 8060 algorithm by various R&D frameworks;
In the field of external links: open up the H5 page scope and redefine negative actions such as leaving external links.

Taking startup as an example, after APM is calibrated, including the stages of picture display, although the data has increased, it is more in line with the business requirements.

(Figure 10 Startup data trend after calibration)

Taking external links as an example, after H5 was connected, the new caliber also increased, but it was more in line with the physical feeling.

(Figure 11 Comparison of the calibrated external chain data before and after calibration)

Based on this campaign, several visual indicators and correction work of the R&D framework have been achieved.

Single machine troubleshooting system

For problem troubleshooting, the core is still based on TLOG. This time, we will only focus on the problems encountered in the key links of log reporting, log analysis, and positioning diagnosis in the user problem troubleshooting process (such as no logs, incomprehensible logs, and difficult positioning), and introduce the efforts made by the operation and maintenance troubleshooting system to improve the efficiency of problem positioning.

(Figure 12 Single-machine troubleshooting and problem location core functions)

Improve the success rate of log uploads and ensure that logs are available when troubleshooting problems from several aspects. First, the built-in active log upload capability is triggered in core scenarios or multiple opportunities for problem feedback to improve the log reach rate, such as public opinion feedback and when new function launches occur abnormally. Second, upgrade the TLOG capability, involving optimizations such as sharding strategies, retries, and log governance to solve the timeliness problem of log uploads that users have reported in the past. Finally, collect various types of abnormal information as snapshots, report them through the MTOP link bypass, and assist in restoring the scene.
To improve the efficiency of log location, we first classify the logs, such as distinguishing page logs and full-link logs to support fast screening and filtering; then we open up the full-link call topology of each scenario, so that we can quickly see which node the problem occurs on, so as to quickly distribute and process it; finally, we structure errors, slowness, UI jams and other problems. The principle is to give the right to interpret domain problems to the domain. For example, there are several types of jam logs, such as APM frozen frames, ANR, main thread jams, etc.; business types include request failures, request RT greater than xx time, page white screen, etc., and the ability to quickly diagnose and locate problems is improved by connecting the capabilities of various fields;
In the construction of full-link tracing capabilities, Hawkeye (the implementation of the distributed tracing system in Alibaba's backend) is connected to many businesses and has a large amount of logs. It is inevitable to do log sampling. For calls that do not hit the sampling, the cache is only 5 minutes. We need to find a way to notify Hawkeye within 5 minutes to keep it longer. In the first stage, the backend parsing service will parse the Hawkeye ID of the call chain and notify the Hawkeye service to store the corresponding trace logs. After successful notification, it can be stored for 3 days; in the second stage, the gateway detects abnormalities, takes out the Hawkeye ID, and notifies the Hawkeye storage to pre-position the storage; in the third stage, similar to scene tracking, obtains the Hawkeye trace logs of the core scenes, and tries to store them on the Ferris wheel platform. The first stage has been launched, and it can achieve associated jumps to the Hawkeye platform. Generally, it takes 5 minutes from the occurrence of the problem to the troubleshooting, so the success rate is not high. It is necessary to combine stages 2 and 3 to further improve the success rate, which is currently under planning and development;
The construction of platform capabilities is based on the analysis of the end-side full-link logs. In terms of visualization, the full-link log content is displayed in a structured manner to facilitate the rapid diagnosis of anomalies in some nodes. In addition, based on structured logs, problems such as time-consuming anomalies, interface errors, and data size anomalies in the full-link logs can be quickly diagnosed.

The above are some attempts made in operation and maintenance this year. The purpose is to use technology upgrades to replace process empowerment with technology empowerment in the field of troubleshooting.

Next, I will continue to show you the practice of Taobao and the effects of access to other apps of the group.

Full-link operation and maintenance practice

1 Troubleshooting Taobao lag issues

Internal colleagues reported that when using the Taobao App overseas, there were problems such as lag and some pages could not be opened. After the above investigation process, the TLOG log was extracted.

Through the "Full Link Visualization" function (Figure 10), we can see that the network status of the H5 page with spanID 0.1 is "failed", causing the page to fail to open;
Through the "full link diagnosis" time consumption anomaly function (Figure 11), we can see that a large amount of network time consumption is distributed in 2s, 3s+, and some are even 8s+. The network stage occurs in the request call stage (transmission), which is related to the slow access of overseas users to Alibaba's CDN nodes.

(Figure 13 Full-link visualization function)

(Figure 14 Full-link jamming diagnosis function)

2 Ele.me main link access

Cold start full link

(Figure 14 Ele.me full link view - cold start full link)

Full store link

(Figure 15 Ele.me full-link view - store full-link)

Optimization practice based on Falco

New indicator system

Now let's focus on how we build an online performance baseline from an end-to-end full-link perspective based on the Falco observable model, and use data to drive continuous improvement of the Taobao App experience. The first step is to build a data indicator system, which mainly includes the following points:

Indicator definition and specification: Close to the user's experience, define relevant indicators around the user's operation flow from clicking to content presentation to sliding page, focusing on collecting technical scenarios such as page opening, content on screen, click response, sliding, etc. For example, content presentation has page visibility and interactivity, picture on screen indicators, and sliding has sliding frame rate (finger), freeze frame and other indicators to measure;
Indicator measurement scheme: In principle, indicators in different fields are assigned to corresponding fields. For example, the jamming indicator can be the manufacturer's caliber (Apple MetricKit), the self-built caliber (APM main thread jamming/ANR, etc.), or the custom indicator of different business domains (full-link scenario), such as MTOP request failure, detail header image on screen, etc.
Indicator composition: It is composed of online aggregate indicators and offline aggregate indicators. It is based on online and offline data and relevant specifications, and is based on user perspectives and competitive situations to drive APP experience optimization.

(Figure 16 App performance indicator system)

Taking APM as an example, the sliding-related indicators are defined as follows:

(Figure 17 APM related indicator definition scheme)

Taking the full link of a scenario as an example, for a specific business, for a user's interactive behavior, from initiating a response to ending a response, from the front end to the server to the client, the complete call link is based on the details of the first screen indicator under the full link of the scenario:

(Figure 18 Scenario full link - details first screen definition)

And many more...

Optimization under the new indicator system

FY22 platform technology focuses on the full-link perspective, takes experience as the outlet, conducts in-depth business investigation and optimization, defines and decomposes problem domains around indicators, and conducts major special optimizations for real user experience. We introduce from the bottom up, general network layer strategy optimization, how to improve connectivity->transport layer->timeout strategy around the request cycle; technical strategy upgrades for user experience, such as gateway and image optimization; technical transformation for business scenarios, pre-processing and pre-loading of venue framework, lightweight practice of security bodyguards, and even business experience grading, such as the homepage information flow does not enable end intelligence on low-end machines. The following will focus on related practices.

(Figure 19 Taobao App full-link optimization technology solution)

1 Request Streamlining and Speeding Up - Minimalistic Calling Practice

Taking the MTOP request as a scenario, the link mainly involves the interaction from "MTOP to the network library". By analyzing the current status of the full-link thread model, several points from the initiation of MTOP to the reception at the network layer will cause the request to be slow:

Multiple data copies: In the existing network layer mechanism, the network library has hook interception processing, which is forwarded to the network library for network transmission based on NSURLConnection + "URL Loading System", involving multiple data copies, and the transfer interception processing is very time-consuming;
Frequent thread switching: The thread model is too complex, and threads are frequently switched after completing a request;
Asynchronous to synchronous: The original request uses a queue NSOperationQueue to handle tasks. The queue maintained at the bottom layer binds the request and response together, so that after sending, it has to wait for the response result to come back before it is released. The "HTTP Operation" occupies all IO of a complete HTTP sending and receiving process, which violates the parallelism of network requests. The operation queue is easily filled up and blocked.

The above problems are more obvious in scenarios with large-volume requests and fierce competition for system resources (cold start, dozens of requests rushing in at once).

(Figure 20 Before and after thread model optimization - minimalist call)

Transformation solution: directly call the network library interface through MTOP to obtain a significant performance improvement

Simplified thread model: Skip the URL Loading System hook mechanism, complete the thread switching for sending and receiving data, and reduce thread switching;
Avoid weak network congestion: Data packet sending and receiving are split and processed, and the long RT of the air interface does not affect the I/O concurrent capacity;
Replace obsolete APIs: Upgrade the old NSURLConnection to directly call the network library API.

Data effect: It can be seen that the optimization is more obvious in an environment with more scarce system resources, such as on low-end machines.

(Figure 21 Simple call AB optimization range)

2 Weak network strategy optimization-Android network multi-channel practice

In a poor Wi-Fi signal or weak network environment, multiple retries may not significantly improve the success rate. The system provides a capability that allows devices to switch requests to cellular network cards in a Wi-Fi environment. The network application layer can use this technology to reduce errors such as request timeouts and improve the success rate of requests.

After Android 21, the system provides a new way to obtain the network object. Even if the device currently has a data connection through Ethernet, the application can use this method to obtain the connected cellular network.

Therefore, when the user device has both WIFI and cellular networks, different requests can be scheduled to both Ethernet and cellular network channels under specific strategies to achieve network acceleration.

Key changes:

Solution premise: whether the current Wi-Fi network environment supports cellular network;
Trigger timing: When a request is sent for a certain period of time without returning data, a request to switch to a cellular network for retry is triggered. The original request process is not interrupted, and the request response of the channel with priority return is used. The request response of the channel with later return is cancelled.
Time control: Orange is configured according to specific scenarios, and it needs to be adjusted dynamically according to the strength of the network;
Product form & compliance: When using the app, users will be prompted with the message "You are using both WIFI and mobile network to improve your browsing experience. This can be turned off in Settings-General". The pop-up policy is triggered for the first time each time the function is started.

(Figure 22 Android multi-channel network capability optimization + user compliance authorization)

Data effect: Under the condition of fierce competition for network resources, in the WiFi+cellular dual-channel network scenario, the optimization of long tail and timeout rate is more obvious. For AB data, home page API, P99/P999 percentile performance increased by 23%/63% respectively, and the error rate decreased by 1.19‰. For home page pictures, P99/P999 percentile performance increased by 12%/58% respectively, and the error rate decreased by 0.41‰.

3 Technical Strategy Grading-Image Grading Practice

The performance of different devices varies greatly, and the complexity of the business is getting higher and higher. Many businesses cannot let users experience the expected effects on low-end devices, but will bring poor experiences such as freezing. In the past, performance was optimized through "delay, concurrency, preloading" and other means, but this only avoided the problem. The core link still had to face the key call time. Therefore, we need to grade the experience of the business. Based on the hierarchical processing of the business process, high-end devices can experience the most perfect and complex process, and low-end devices can also use the core functions smoothly. The ideal is to achieve both high user experience and core business indicators. To put it another way, the performance experience can be better while some functions are damaged (without affecting the core business indicators). The initial idea is to achieve this in two steps:

In the first phase, business classification requires a rich policy library and judgment conditions to implement classification. We will accumulate these general capabilities in core components to help businesses quickly implement business classification capabilities.
In the second stage, as a large number of businesses are connected to the classification capability and a large number of business classification strategies and AB data are accumulated, it is possible to recommend and optimize single-point business classification strategies, so that a large number of similar businesses can be quickly reused and efficiency improved.

Traditional CDN adaptation rules will dynamically assemble to obtain the "optimal" image size based on factors such as the network, view size, and system to reduce network bandwidth and bitmap memory usage, thereby improving the device image loading experience. This device classification perspective will be based on the specifications given by UED to make compression parameters configurable, expand the original CDN adaptation rules, and implement image classification strategies for different models. This capability can further reduce the size of images and speed up the loading of images onto the screen.

(Figure 23 Image equipment classification rules)

4 Lightweight link architecture-safe visa-free practice

The external link pull-end link involves multiple security signatures from startup to customs request and then to landing page loading (the main request is still MTOP). Signatures are CPU-intensive tasks, and the low-end machine has a significant tail. If the pull-end takes too long, it will cause traffic loss. In FY22 S1, a lot of performance optimizations have been made to the pull-end link in the Julang business. Optimizing performance can reduce the bounce rate. At present, the performance is dominated by customs requests. The time-consuming security signature of customs requests accounts for a high proportion. Therefore, it is hoped that security signatures can be skipped. The business can use it according to the situation to improve the traffic value of the inbound end. The link involves MTOP, Aserver (unified access layer), and security multi-party transformation:

(Figure 24 Changes in the security visa-free architecture)

Gateway protocol upgrade: The protocol upgrade supports signature-free, and provides a signature-free interface. If the business API is set to be signature-free, the header is carried to the network library;
AMDC Scheduling Service: Considering stability, it will be scheduled to the online secure production environment through AMDC (Wireless Network Policy Scheduling Service) in the short term. Therefore, the AMDC scheduling module will determine whether to return the client's signature-free VIP according to the description mark. After the function is stable, it will be flexibly scheduled to the online master station environment.
Signature verification module migration: The security extension capability is pre-installed in the AServer access layer. Considering the operation and maintenance cost, the capability will be uniformly migrated from Aserver to security. In the future, Aserver will not have a signature extension module. Security will decide to enable signature verification and other functions based on API/header features.
MTOP visa-free error retry: In the visa-free case, if the MTOP layer encounters an illegal signature request failure, it will trigger the downgrade of the old link to ensure user experience.

Summary & Outlook

Summary: This article mainly explains how to build observability capabilities by implementing call chain tracing, standard logging, and scenario-based tracing in the face of existing challenges on the mobile side. Based on the full-link perspective and new observability capabilities, a full-link operation and maintenance system and a continuous performance optimization system are built to make up for the long-missing call chain tracing capabilities on the mobile side, solve the problem of rapid location of problems in complex call scenarios, and change the inefficient process of group investigation in the past. It starts the transformation from process empowerment to technology empowerment, and builds full-link Metrics indicators around this capability, creates a full-link performance indicator system, conducts in-depth governance in business scenarios, upgrades platform technical capabilities, and uses data to drive business experience improvement and long-term tracking of experience.

Insufficients: Although the Taobao App is gradually being connected to various scenarios, there is still a big gap from locating the problem within 15 minutes. There are still many related bottlenecks, such as the success rate of log reporting, the effectiveness of server-side log acquisition, the improvement of problem location efficiency, the productization and technicalization of data quality inspection at the access source, the understanding of the problem by the technical party in the field and the continuous accumulation of structured information, and finally the user experience of the entire product, which needs to be continuously optimized.

Outlook: Continuing the mobile native technology concept of Alibaba Mobile Technology Group, we need to do a good job in technology and experience, and we need to go deep into the hinterland of the mobile domain and face challenges in the fields of multi-development frameworks in the east-west direction and end-to-end full links in the north-south direction. In the first phase of experience optimization in 2018, we introduced similar concepts and conducted experiments in the request field. Until now, we have found a suitable structured theoretical foundation, and through in-depth practice based on the characteristics of mobile terminals, we continue to define and solve problems in the field. We hope to create an observable technology system in the mobile domain and form an architectural precipitation.

<<: WeChat supports chat picture search on trending search! Netizens: How difficult is it to come up with a useful function?

>>: Finally decided, Apple completely gave up on under-screen fingerprint recognition, iOS15.4 made great contributions

Judging from the sales of Double 11, the conclusion that the OTT industry is dying is absurd

Recommend

How to build an effective online training camp community?

First of all, it is necessary to introduce my job...

App Store is frequently experiencing abnormalities. How can we use mainstream application promotion methods to “guard” the charts?

This article mainly includes two points: 1) Intro...

Gong Liqin's practical manual on love and sex: EFT couple therapy uses EFT emotionally-focused therapy to solve love problems

Gong Liqin's practical manual on love and sex...

"Food Safety Guide" series: Three suggestions for preventing "litchi disease"

...

Why is the “Hua Xizi” IP marketing so popular?

At present, people born in the 1990s and 1995s ha...