backgroundEvery day, a large number of users around the world share creative videos or wonderful stories in life on short video apps. Since the user's environment is uncontrollable (weak network environment such as high-speed rail and elevator), directly playing the video with original picture quality may cause freezing or even failure to play during the viewing process, resulting in a poor viewing experience. In order to allow users with different network conditions to watch these videos smoothly, each video release needs to go through the transcoding process to generate videos of different gears. Even if the user is in an environment with poor network conditions, videos of suitable gears can be provided, allowing users to have a smooth viewing experience. For the scenario of video transcoding, the current common solutions in the industry are mostly to upload the original video to the object storage, and then trigger the media processing process through events. This may involve the use of a workflow system to schedule or arrange media processing tasks. The processed video is archived to the object storage and then distributed to viewers through the Content Delivery Network (CDN). Figure 1: Industry-standard solutions for media processing systems on AWS public cloud [Source] https://aws.amazon.com/media-services In the public cloud solutions commonly used in the industry, developers need to integrate various subsystems on the public cloud to complete the life cycle of video transcoding and manage virtual machine computing resources. This requires a large cognitive cost and learning cost for various cloud services, which is a heavy burden for developers. At ByteDance, the video architecture team has formed an internal multimedia processing PaaS platform in the field of video after years of technical accumulation. Users upload videos to the object storage through the upload system, which then triggers the task scheduling system of the media processing platform to send tasks such as transcoding, generating animated images, and cover photos to a computing resource pool with massive resources, and finally distribute them to viewers through the content distribution network. The multimedia processing PaaS platform is mainly composed of two subsystems, the workflow system and the computing platform, and provides multi-tenant access to support the video processing needs of the entire ByteDance ecosystem. The computing platform mainly provides a large computing resource pool, which encapsulates various heterogeneous resources (CPU, GPU), so that developers specializing in media processing in the team do not need to pay attention to the work related to computing resource integration, but can focus on developing serverless functions with various atomic capabilities, providing rich functions such as transcoding, generating cover photos, and generating animated images. The workflow system provides the ability to schedule tasks, which can define the order of various media processing tasks that need to be performed after the video is uploaded, and send the tasks to the computing platform to use the large computing resource pool to complete the media processing work. The basic capabilities provided by these two subsystems can greatly reduce the burden on developers and increase the speed of function iteration. Figure 2: Video architecture team media processing system solution Technical Framework 1.0For such a large online system, it is particularly important to maintain its stability and to accurately and quickly handle problems when any abnormalities occur online to reduce the impact on users. Service level indicators (SLIs) are defined for multimedia processing PaaS platforms deployed around the world, and service level objectives (SLOs) are defined based on them, and appropriate alarm rules for SLOs are configured. Figure 3: Emergency response process As shown in Figure 3, when a service exception occurs, for example, when the request accuracy rate is lower than 99.9% within 5 minutes, an alarm is triggered and a Webhook message is sent to the emergency response center platform developed by the team. The platform will create an alarm processing group for the current on-duty personnel and aggregate all subsequent related alarm information into the group. SRE will then intervene to handle the problem. After the alarm processing group is created, the current process mainly relies on SRE to independently collect abnormal indicators related to emergency events. The lack of automated tools to summarize information in advance may cause the overall accident handling process to take a long time to sort out the current abnormal indicators before performing accident stop loss operations. Current pain pointsLarge number of microservices and dependenciesIn the team, most of the service development is based on the microservice architecture, and as an internal PaaS platform, it is bound to provide global cross-regional services. Therefore, in terms of the service itself and infrastructure, it is necessary to deploy in multiple regions and multiple computer rooms. Currently, there are 30 microservices for media processing task scheduling in a single region. In addition, it is necessary to consider the monitoring of related infrastructure, such as database, cache, distributed lock, and message queue, etc. Figure 4: A large number of microservices monitoring dashboards Therefore, even if a monitoring dashboard with a global perspective is created as shown above, it is still a challenging task to quickly locate abnormal points in such a large service topology when an emergency occurs. Different indicators have different benchmarksWhen an emergency occurs, it can usually be divided into the following two situations: infrastructure abnormalities and sudden traffic. The first example is database infrastructure. Usually, the query latency under normal operation is at a fixed level, such as 10ms. The increase in latency can be divided into: the overall database latency increase (probably due to high current load) and the increase in latency of some instances (probably due to jitter in the department's network segment). Figure 5: Database latency anomaly indicator The second example is burst traffic. As an internal PaaS platform, it is bound to provide multi-tenant functions to serve the needs of many teams within ByteDance. When the number of tenants reaches a certain level, it is no longer economical to understand when each tenant has activities or burst traffic. Taking the example below as an example, you can see that the indicators are distributed regularly with a daily cycle, but you can see that the purple indicators in the red box are significantly more than yesterday, which is called yesterday's year-on-year increase. Figure 6: Traffic indicators increased year-on-year Troubleshooting involves different internal systemsThe third example involves errors in dependent systems. Take the following figure as an example. The number of errors in the red box is obviously much higher than in the past half hour. This is called a month-on-month increase. In this case, you need to go to the internal PaaS platform to query the detailed error code and the corresponding error log. Figure 7: Dependency system error indicators increased month-on-month TargetIn the above three situations, with the existing monitoring and troubleshooting methods, when an emergency occurs, the entire troubleshooting process needs to compare multiple dashboards or even constantly switch different query time periods on the dashboard to compare the normality of indicators. What's worse, it is necessary to open other internal systems to search for logs, which greatly prolongs the decision-making time for locating problems and making emergency responses. Therefore, if the above troubleshooting work can be automated to a certain extent, it will greatly increase the speed at which SRE members can troubleshoot by referring to the duty manual SOP (standard operating procedure) while on duty, and can reduce the hardship of being on duty. Quantitative indicatorsMean time to repair (MTTR) includes the overall time of fault identification, fault location and loss prevention (Know), and fault recovery (Fix). The main purpose of introducing automated troubleshooting tools is to reduce the time of fault location and loss prevention. Currently, the system has set data statistics on alarm occurrence and recovery time for SLO targets, so it was decided to use MTTR as a quantitative indicator of the results of this automated system introduction. Figure 8: Range of mean repair time in incident time series ArchitectureTechnical Architecture 2.0Figure 9: Improved emergency response process The Emergency Center platform (Emergency Center) developed by the video architecture stability team has a built-in integrated solution called SOP Engine, which provides an SDK for SRE members to quickly develop common serverless functions such as metrics query, analysis, and HTTP request initiation. It can also use yaml to define state machine workflow orchestration to implement customized alarm diagnosis or emergency response plan workflow orchestration. Automated workflow designThe entire automated alarm processing process can be summarized into the following steps:
Figure 10: Internal process of automated workflow Metrics Query FunctionThe Metrics query function is designed with the following API example, which can be connected to the Metrics platform built by Byte based on OpenTSDB. It mainly provides the following functions to greatly improve the reusability of this function.
{ Metrics Analysis FunctionThe Metrics analysis function is designed with an API as shown in the figure below, which allows you to customize the summary results (maximum, minimum, average, and total) to be analyzed for thresholds, year-on-year analysis, and even drill-down analysis of a tag in the Metrics. In addition, the comparison operator and threshold can also be adjusted at will, which provides great convenience for subsequent modification of thresholds or analysis logic. { JavaScript Execute FunctionIn the steps of Metrics aggregation and robot card information assembly, the aggregation conditions of different Metrics and the display logic of robot cards are different. If they are developed separately, the reusability and development efficiency of the overall function will be reduced. Therefore, the github.com/rogchap/v8go suite was used to develop a function that can dynamically execute JavaScript on the input JSON data to handle this series of uses. JavaScript is the best way to process JSON format data. As shown below, the Array data in JSON can be grouped, sorted, reversed and mapped using native JavaScript, which is very convenient. { Real-world example: MySQL latency diagnosisThe following figure is an example of actual abnormality diagnosis and how to combine the above three functions. The following figure takes MySQL latency as an example. It can be seen that most MySQL latency is normally within 1s. The latency of one host suddenly rises to 20.6s. This needs to be actively discovered during emergency response and is an abnormality that may cause an emergency event. Figure 11: MySQL delayed single instance exception
MetricQuery :
MetricAnalysis :
GroupResult : Finally, after executing the above three workflows, the following data output results can be obtained. Basically, the abnormal metrics and diagnostic conclusions have been grouped and filtered in a structured way, and the diagnostic conclusions are attached, which can be used as the input of the chatbot message. { The indicator diagnosis related to the application container, such as CPU, Memory, or the metrics of the application itself, all follow similar logic to orchestrate the workflow. You only need to replace the query metrics and diagnosis thresholds. incomeWith the above automated analysis tools, the video architecture team has benefited greatly in its daily emergency response process. In an emergency, the network of an IDC failed, as shown in the following figure: the error and delay of a certain IP were particularly high. The diagnosis automatically triggered in the emergency response processing group can directly discover such anomalies and immediately handle the abnormal instances. Figure 12: Automated tools display summary information of abnormal indicators in incident groups After the complete introduction of this automated process, the MTTR reduction effect is shown in the following figure. From the beginning of October 2022 to the end of January 2023, the MTTR is calculated every two weeks: from the initial 70 minutes to the current 17 minutes, an overall decrease of about 75.7%. SummarizeIn the face of such a large number of microservices and a wide variety of infrastructure dependencies, in order to be able to make decisions quickly and perform emergency operations when an emergency occurs, in addition to relatively complete monitoring, it is necessary to collect emergency response processing records on a regular basis to count high-frequency events and summarize an automated troubleshooting process to shorten MTTR. |
<<: From the ground to the sky, maybe this is the future of mobile phones?
>>: iOS 16.4 official version is released, it is recommended to upgrade!
Many people believe that China has a large popula...
As people's behavioral habits gradually move ...
The latest SEO training case: Darwin's theory...
Mobile Checker (official website) is a mobile pag...
Now more and more companies are using short video...
To create visually appealing apps, displaying ima...
Through a professional team, we provide website o...
New media operations basically revolve around fan...
[[154774]] Today, media reported that the Q&A...
C4D product performance first issue [HD quality w...
(1). Choice of keywords: It depends on whether yo...
Glodon News, August 22丨In an interview with US me...
Recently a friend asked me which one is better, Y...
Original title: Due to the epidemic, "Black ...
Today we will talk about the last element of the ...