Tingyun Liao Xiongjie: Full-stack APM--Build an all-round monitoring system from end to cloud

Tingyun Liao Xiongjie: Full-stack APM--Build an all-round monitoring system from end to cloud

[51CTO.com original article] The WOTA Global Architecture and Operation Technology Summit hosted by 51CTO was held in Beijing Renaissance Hotel from April 14 to 15, 2017. This WOTA has set up 15 cutting-edge hot technology forums, and 60+ technical experts from domestic and foreign first-tier Internet companies such as Google, LinkedIn, Airbnb, Baidu, Alibaba, Tencent, etc. will bring more than 50 accumulated architecture practical experience and successful experience sharing cases, and work together to create a two-day industry-leading technology event.

On the morning of April 14, at the main venue of WOTA2017, Liao Xiongjie, Vice President of R&D of Tingyun, gave a wonderful speech entitled "Full-stack APM - Building an all-round monitoring system from end to cloud". The following is the transcript of the speech, let's take a sneak peek!

[[188574]]

Liao Xiongjie, Vice President of R&D at Tingyun

I am very happy to meet you here. Today I will introduce the five aspects of APM tools and operation methods. In the operation and maintenance scenarios, as well as product application and interaction scenarios, you may encounter some performance problems, which may also be in the product operation stage, because these problems directly affect the user's final experience.

For large applications, we may also involve many links. First, we will access from the end user's APP or PC browser, and there will be some problems on the end user's side. There may be links between the network CDN and the cloud between the end user and our server. There will be links between servers and servers inside the server, and various components will interact.

Especially now that many companies are promoting microservices and other service-oriented architectures, we encounter a problem in this architecture, that is, your architecture level is becoming more and more complex, and the volume, difficulty and complexity of monitoring for operation and maintenance will also increase accordingly.

However, how to monitor this link after the problem occurs? We need to monitor the other end of the network, which requires CDN or local operators. It may be a problem with the user's own network environment. All problems need to be located, what exactly is the problem, and what needs to be optimized and solved. Problems may occur in various components inside the server.

Just now, Zhang Xia from AWS introduced the concept of AWS cloud. In recent years, cloud has been a great blessing for the operation and maintenance community. Now, all applications are deployed on the cloud. It can be said that cloud has liberated a large part of the operation and maintenance experience, freeing us from the cold computer room. Now many operations and maintenance basically do not need us to carry servers around the streets every few days.

Today, we will introduce the concept of APM. For operation and maintenance, how to understand each link of the application, and how to troubleshoot when encountering or causing problems. At first glance, everyone thinks that this is the work of the R&D team. In fact, this problem is a big trouble for operation and maintenance personnel. When a problem occurs, the operation and maintenance team must intervene first. You need to determine whether the problem is in the basic environment or in the application itself. Only after the rights and responsibilities are clearly defined, can you hand over the problem to R&D or leave it to operation and maintenance.

Today, I will first introduce several major functional dimensions of APM, which are actually also components of APM. We will introduce several commonly used methods for implementing APM.

First of all, DEM, whether it is external or internal, monitors the availability and performance of the application itself. This is our most intuitive monitoring. It should be said that when our users have problems, it should be related to this first. We first monitor this status continuously, so that we can check whether there are deeper problems later. DEM is a relatively large aspect and functional dimension. The main tool set here is divided into two types. One is RUM, which is the performance monitoring of real users. Usually based on the Web and mobile terminals, when users visit, customers focus on the Web browser or the mobile terminal. From the time each formal user makes a request to the application, in this process, through certain methods, such as the browser through the embedding method, the mobile terminal has a more complex embedding method, which requires automatic embedding, because this cannot be embedded in the R&D order. The two embedding processes should be completely automated.

First, there is real user monitoring. Another method that corresponds to it is STM simulated transaction monitoring. What is the difference between this and the previous one? For real user monitoring, take the example of a web browser and see what kind of defects it has. In terms of performance, it is to initiate monitoring from the perspective of real users, which is definitely the most reasonable. When problems occur, we are most concerned about users. However, this monitoring method has a fatal flaw. When we talk about monitoring, availability and performance must be monitored first.

If it is a web browser method, most of the current GS methods are embedded codes, which means that your browser and your current page must at least request and complete normally, and then your GS can be run normally. If there is a network abnormality or just an error in the page GS during this process, your monitoring script will not be able to load or run at all, so there is a big problem. When it is abnormal, it is difficult to monitor in this way. Therefore, STM has a relatively large advantage in this regard. STM simulates users. You can deploy some robot nodes in a targeted manner in the computer room or on the end user, on the final machine. It may not be a real user, but it is the same as a real user. It is actually the end user's access. You can control the browser to capture various indicators of network events and performance. These are two relatively important monitoring methods, one is DEM, one is RUM, and one is STM.

Then, the second major function of APM is DATD. What does this mean? Just now I said that from the perspective of the end user, whether it is a formal user or a simulated monitoring, it is to observe the application from the perspective of the end user. So the internal application of this method, such as access to the database or access to MQ, cannot be monitored by DEM because it cannot be seen at the user's remote end, at most to the network end, and it cannot be seen inside the server. So the second functional category is needed here, which is ADTD. You need to describe the internal server, especially the microservice architecture, what is the call relationship between service A and service B, whether there is a problem in the call process, and if there is a problem, is it service A or service B. So all these traces should be described here. If the data is not described, there is no way to talk about monitoring.

Another important feature for the second major area is that in addition to describing the relationship between them and performance, once a problem is found, it can be drilled in depth. When the end user accesses the application, the application can see it coming in, the interaction between the backend and other services, and the interaction with the database, cache, and MQ. All components should be drilled out. Otherwise, what you see in the end is that there is a problem with the service, but you don’t know where the problem is, so deep drilling is also necessary. When it has a problem, we actually have the means to analyze the line-level code and find out which line of code has a problem. Because inside the application, usually through reasonable methods, through code implantation, we can get the information about the code. When a problem occurs, we can even define the difference in line-level code. This should be a very useful tool for operation and maintenance. Because there is no need to understand the details of each application development, the problem can be quickly located. All of these are tooled.

The previous part mainly introduces the source of data and how we capture this data. With this data, we can use machine learning and statistical inference to find the source or root cause of abnormal data performance. We believe that there are frequent alarms. If service A, service B, and service C all have alarms, what is the problem? At this time, we need to trace the root cause. We can use statistical learning methods and machine learning methods to analyze the data and draw conclusions. This is a problem of data processing in the later stage.

I just introduced the main implementation methods of APM. We have described the main implementation methods of APM, including its functional dimensions. Now let's take a look at what we can do at each point. Our APP may be a native APP, or it may be developed by H5. It works inside your APP. This part is part of the RUM APP, which is divided into two parts. These two technical means can fully monitor from the end end and see the network performance from the front end. These can all be done. Including the front-end performance, for example, if there is a problem with the execution of a script, you can at least locate the approximate part. If there is a problem with the browser rendering, it can be monitored on the front end.

The middle part is the network layer, which is the area where STM works. It can find some network problems to the greatest extent. Why is the network placed at this end? Simulation means that we can deploy it anywhere. For example, we deploy it in the computer room. You can deploy it on the end user's machine. Your computer room and end users can also allocate monitoring resources according to the regional operators you care about and according to your proportion. You don't have to return as much as regular users. You can use this time to do more things. There is an advantage, which is the STM method. Usually its monitoring method is essentially that we develop a special A policy, which means that you can get more things from the browser. We know that GS works in the browser of real users, and you can actually do less. You may not be able to obtain a lot of data you want to obtain due to security or technical limitations. In this area of ​​STM, you can capture as detailed data as possible and analyze the problem more thoroughly. Through STM, you can locate whether it is a CDN problem, a local network problem, a local operator network problem, or a backbone network problem. These can all be located. You can deploy your nodes in different locations, so that you can distinguish many latitudes.

The internal servers, including the applications of the internal servers in the cloud, are the work area of ​​ADTD. Applications can be monitored. In theory, the access itself can be monitored from the place where the application initiates the access. Data is monitored through JDBC, and the monitoring code is embedded, and the accessed database appears. All embedded codes should be said to be unified in technology, unlike what we said that there may be many professional monitoring, such as databases, each of which is deployed for different protocols and different servers. As for the implementation of APM, we generally implement it in a unified way, because when there is a problem with the application, what we ultimately care about is whether there is a problem when the application initiates access to a certain component. If there is a problem, you can locate it for me.

This is a basic topology diagram. The first one you see is an overview. The second one is the real user situation. This is an IOS application. You can see each of its network access requests. Its curve has a time period, and its access time is marked. By analyzing a large amount of real user data, and then presenting this data in a graphical and visual way, this is in line with the basic principles of operation and maintenance. All monitoring data should be measured and then visualized, so that it can become a tool.

This is the real user experience. You can see the network slices here and see what the network includes. You may be more concerned about how long it takes for DS to resolve and how long it takes to establish a connection. Then there will be the first report time, which is the time for the service to respond. Basically, you can locate whether it is a network problem or a server problem. If it is a server problem, you can use other technical means. I just mentioned that through the STM monitoring method, network slicing, and other things like establishing a connection are relatively normal. It can be judged that due to the blockage inside the server, it has been blocked for a period of time and a first report is sent to the client. We will send a complete request to its response back. Different stages of the network can make very detailed slices. This is part of the network.

I just talked about ADTD. What can we do in this area? The block shows the backend access to different services, the interaction between services, services and databases and MQ, etc., which can be automatically discovered through the topology. It should be said that this diagram operation and maintenance is also quite popular. After all, the architecture is becoming more and more complex. Basically, when the application becomes more and more complex, it is often difficult to control its backend architecture. For example, how do applications interact with each other, how do applications and components interact with their dependencies, including each service, each component, its call count, throughput and error rate, all of which can be displayed in an intuitive way.

The second is what operation and maintenance and R&D are more concerned about. When a problem occurs, they definitely want to know which line of code has the problem. The first step is that after you locate the problem, we can call the code in advance, and other information, if it is a SQL call, can be automatically captured to assist your subsequent development for further analysis.

For example, here we simply show different calling components and the time they take. The figure on the left shows the number of calls to different components and the proportion of the call time of each component.

Now let's briefly summarize. Now let's talk about the simple steps of full-stack APM. For real user performance, we use DEM here, mainly RUM. In terms of network slicing, we mainly use STM in DEM, which is a simulation monitoring method. Network slicing is the most detailed. Another one is NPM, which is not introduced. Maybe the operation and maintenance team has had such experience. With NPM, you can use the switch in your computer room to analyze each packet of traffic through special software, and then analyze its performance and the relationship between each packet from the traffic packet. But this is more limited. When you get the traffic packet from the server, a lot of information may have been lost. You only have one packet data, and the content it can analyze is relatively limited.

The backend application logical topology, including each component and performance in the topology, is monitored through ADTD, including code-level monitoring that can monitor the application process, the number of each request, and its average time.

Introducing full-stack APM, I believe that everyone should have an obsessive-compulsive disorder for operation and maintenance. Just now we talked about so many monitoring methods, can we string them together to make one-stop monitoring. For example, just now we talked about from real users to servers, to our backend components. When real users find problems, can we directly troubleshoot from real users to the *** end step by step, and locate the *** is it a network problem or a server problem. If it is a server-side problem, which component is the problem. Including if it is on the server side, there is a problem when a certain service on the backend is called, causing the front-end response to slow down, can it be exposed in a one-stop manner, and including the line-level code analysis just mentioned, these methods can be combined.

I just talked about the RUB of the Web. How do we trace the browser to the server? Including the relationship between their performance, we first monitor the browser. The monitoring method will be briefly introduced later. We monitor that the response time of a request is relatively long.

Let's look at the following graph, which is broken down into the server response time, the network layer, and the front-end rendering. When displaying, we first use the server time as a separate indicator. In this graph, you can see whether it is a problem on the server or on the front-end network.

We can drill directly to the application with the problem associated with the backend. This is the corresponding request that has reached the server. After clicking on this request, we will see a certain component, which is a service to another backend. Its response time is relatively high. We can drill out all of them at once.

If we look further, since we have already reached the server, there is no need to talk about the backend in detail, because basically most of the things in ADTD have been briefly introduced just now. Since we have already reached the server backend, we can drill down to find out which component it is. This is a detailed analysis of the browser. The page browsing response time of each element can be displayed. We see that one of the elements takes a long time. Then we start from the element level and drill back for each element. For example, if the request is slow, its backend may correspond to another application. Can we drill from here to the backend application?

After drilling into the backend application, we can analyze the ADTD backend. For example, we can see that it requests another URL on the backend, and a problem occurs during the request, and the response time is relatively long. Looking further, we can see which method and which line of code has the problem.

Let me briefly introduce the specific implementation method. In fact, it is quite simple. We need to embed the browser and the server. First, it will automatically embed the code, and the server will also automatically embed the code. After embedding, what we need to do is to send the request from the browser to the server, and then from the server back to the browser. We put the request and response process in a certain place and send it to the server, and then send it back. For the browser method, we can directly change Ajax, but you can't change the HTTP header of the main page request, but is there any way? We also embed the server side by embedding. In fact, we can directly intercept the JSP and PHP compilation process when embedding the server side, and we directly output some information that can be associated. For example, generate something and put it in the page, and then bring it back. There will always be some technical means to achieve this process. So we have a way to associate it.

Java can be modified automatically. We can just put the time before and after a function and upload it. When an exception occurs, it can also be detected and sent to the server. The server will eventually access the database through this code interception method. You will eventually achieve this by calling a function in the API, so we need to intercept these functions.

The browser is even simpler. I believe everyone has seen similar GS codes. We have a lot of advertising analysis and user analysis, and many websites have them. For APM, we want to obtain its performance. A long time ago, we used GS directly, but there were many times when we couldn't obtain it. For example, inside the browser, it was not open through the GS API. After 2011 and 2012, W3C opened these two standards, and most mainstream browsers also implemented such standards. In fact, the implementation method is relatively simple. Take a look at it. It has a Navigation timing interface. It starts and ends at what time. The corresponding parsing time, rendering time, and link building time can all be obtained. After we inject the code, you can get all the front-end network you want, as well as the front-end parsing and performance monitoring data. After that, do some simple analysis on it, and a monitoring interface will appear.

Just now we have roughly introduced how to do one-stop APM tracing from Browser to Server. In fact, there are similar methods for APP. The monitoring data is obtained and the code is embedded. There must be technical means.

This includes backend services and services, and services and databases. It is mainly between services. We say cross-application, including tracking between services. API microservices may use it more. After we get so much data, we can track the call chain. All requests from service A to service B and service C can be described. When multiple services alarm at the same time, how to use this data to analyze the root cause of the problem and what caused it?

This time, I introduced the APM usage scenarios, including the main tools in the APM suite and several main implementation methods in the APM suite. This is the end of my speech. Thank you.

51CTO reporters will continue to bring you exciting reports from the WOTA2017 Global Operation and Architecture Technology Summit, so stay tuned!

[51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites]

<<:  Gradle for Android Part 5 (Multi-module build)

>>:  Wang Xiao of Jiuhe Venture Capital: Is technology really just a job for the young?

Recommend

Source Code|Cloud Tags

Source code introduction: cloud tag deletion, add...

Kaikeba Data Analysis High Salary Training Program Elite Class-030

Kaikeba Data Analysis High-salary Training Progra...

Mayu product operation analysis!

Menstruation is a very private and important matt...

Zhu Baiban's Douyin sales promotion course

Zhu Baiban's Douyin Advanced Course Resource ...

29 pictures to teach you how to improve APP conversion rate!

This reading note will be divided into several se...

Investing in cyclical industries and commodity cycles

Introduction to cyclical industry investment and ...

Event operation and promotion data analysis formula!

This article will focus on operational activities...

Apple App Store’s latest review rules for 2016!

serial number Chinese content 1.1 When developing...

3 ways to increase user traffic!

User growth has always been a key to business ope...

Baiguoyuan’s private domain growth strategy!

First, analyze the best cases in the industry and...

Strategies for developing a brand marketing plan!

As the year draws to a close, major brands have j...

How much does it cost to rent a server with high-defense DDOS IP?

How much does it cost to rent a server with high-...

What is a learning and growth community?

In 2016, you must have seen many WeChat friends p...