Automation practice of emergency response in short video media processing system

Automation practice of emergency response in short video media processing system

background

Every day, a large number of users around the world share creative videos or wonderful stories in life on short video apps.

Since the user's environment is uncontrollable (weak network environment such as high-speed rail and elevator), directly playing the video with original picture quality may cause freezing or even failure to play during the viewing process, resulting in a poor viewing experience. In order to allow users with different network conditions to watch these videos smoothly, each video release needs to go through the transcoding process to generate videos of different gears. Even if the user is in an environment with poor network conditions, videos of suitable gears can be provided, allowing users to have a smooth viewing experience.

For the scenario of video transcoding, the current common solutions in the industry are mostly to upload the original video to the object storage, and then trigger the media processing process through events. This may involve the use of a workflow system to schedule or arrange media processing tasks. The processed video is archived to the object storage and then distributed to viewers through the Content Delivery Network (CDN).

Figure 1: Industry-standard solutions for media processing systems on AWS public cloud

[Source] https://aws.amazon.com/media-services

In the public cloud solutions commonly used in the industry, developers need to integrate various subsystems on the public cloud to complete the life cycle of video transcoding and manage virtual machine computing resources. This requires a large cognitive cost and learning cost for various cloud services, which is a heavy burden for developers.

At ByteDance, the video architecture team has formed an internal multimedia processing PaaS platform in the field of video after years of technical accumulation. Users upload videos to the object storage through the upload system, which then triggers the task scheduling system of the media processing platform to send tasks such as transcoding, generating animated images, and cover photos to a computing resource pool with massive resources, and finally distribute them to viewers through the content distribution network. The multimedia processing PaaS platform is mainly composed of two subsystems, the workflow system and the computing platform, and provides multi-tenant access to support the video processing needs of the entire ByteDance ecosystem.

The computing platform mainly provides a large computing resource pool, which encapsulates various heterogeneous resources (CPU, GPU), so that developers specializing in media processing in the team do not need to pay attention to the work related to computing resource integration, but can focus on developing serverless functions with various atomic capabilities, providing rich functions such as transcoding, generating cover photos, and generating animated images. The workflow system provides the ability to schedule tasks, which can define the order of various media processing tasks that need to be performed after the video is uploaded, and send the tasks to the computing platform to use the large computing resource pool to complete the media processing work.

The basic capabilities provided by these two subsystems can greatly reduce the burden on developers and increase the speed of function iteration.

Figure 2: Video architecture team media processing system solution

Technical Framework 1.0

For such a large online system, it is particularly important to maintain its stability and to accurately and quickly handle problems when any abnormalities occur online to reduce the impact on users. Service level indicators (SLIs) are defined for multimedia processing PaaS platforms deployed around the world, and service level objectives (SLOs) are defined based on them, and appropriate alarm rules for SLOs are configured.

Figure 3: Emergency response process

As shown in Figure 3, when a service exception occurs, for example, when the request accuracy rate is lower than 99.9% within 5 minutes, an alarm is triggered and a Webhook message is sent to the emergency response center platform developed by the team. The platform will create an alarm processing group for the current on-duty personnel and aggregate all subsequent related alarm information into the group. SRE will then intervene to handle the problem. After the alarm processing group is created, the current process mainly relies on SRE to independently collect abnormal indicators related to emergency events. The lack of automated tools to summarize information in advance may cause the overall accident handling process to take a long time to sort out the current abnormal indicators before performing accident stop loss operations.

Current pain points

Large number of microservices and dependencies

In the team, most of the service development is based on the microservice architecture, and as an internal PaaS platform, it is bound to provide global cross-regional services. Therefore, in terms of the service itself and infrastructure, it is necessary to deploy in multiple regions and multiple computer rooms. Currently, there are 30 microservices for media processing task scheduling in a single region. In addition, it is necessary to consider the monitoring of related infrastructure, such as database, cache, distributed lock, and message queue, etc.

Figure 4: A large number of microservices monitoring dashboards

Therefore, even if a monitoring dashboard with a global perspective is created as shown above, it is still a challenging task to quickly locate abnormal points in such a large service topology when an emergency occurs.

Different indicators have different benchmarks

When an emergency occurs, it can usually be divided into the following two situations: infrastructure abnormalities and sudden traffic.

The first example is database infrastructure. Usually, the query latency under normal operation is at a fixed level, such as 10ms. The increase in latency can be divided into: the overall database latency increase (probably due to high current load) and the increase in latency of some instances (probably due to jitter in the department's network segment).

Figure 5: Database latency anomaly indicator

The second example is burst traffic. As an internal PaaS platform, it is bound to provide multi-tenant functions to serve the needs of many teams within ByteDance. When the number of tenants reaches a certain level, it is no longer economical to understand when each tenant has activities or burst traffic. Taking the example below as an example, you can see that the indicators are distributed regularly with a daily cycle, but you can see that the purple indicators in the red box are significantly more than yesterday, which is called yesterday's year-on-year increase.

Figure 6: Traffic indicators increased year-on-year

Troubleshooting involves different internal systems

The third example involves errors in dependent systems. Take the following figure as an example. The number of errors in the red box is obviously much higher than in the past half hour. This is called a month-on-month increase. In this case, you need to go to the internal PaaS platform to query the detailed error code and the corresponding error log.

Figure 7: Dependency system error indicators increased month-on-month

Target

In the above three situations, with the existing monitoring and troubleshooting methods, when an emergency occurs, the entire troubleshooting process needs to compare multiple dashboards or even constantly switch different query time periods on the dashboard to compare the normality of indicators. What's worse, it is necessary to open other internal systems to search for logs, which greatly prolongs the decision-making time for locating problems and making emergency responses.

Therefore, if the above troubleshooting work can be automated to a certain extent, it will greatly increase the speed at which SRE members can troubleshoot by referring to the duty manual SOP (standard operating procedure) while on duty, and can reduce the hardship of being on duty.

Quantitative indicators

Mean time to repair (MTTR) includes the overall time of fault identification, fault location and loss prevention (Know), and fault recovery (Fix). The main purpose of introducing automated troubleshooting tools is to reduce the time of fault location and loss prevention. Currently, the system has set data statistics on alarm occurrence and recovery time for SLO targets, so it was decided to use MTTR as a quantitative indicator of the results of this automated system introduction.

Figure 8: Range of mean repair time in incident time series

Architecture

Technical Architecture 2.0

Figure 9: Improved emergency response process

The Emergency Center platform (Emergency Center) developed by the video architecture stability team has a built-in integrated solution called SOP Engine, which provides an SDK for SRE members to quickly develop common serverless functions such as metrics query, analysis, and HTTP request initiation. It can also use yaml to define state machine workflow orchestration to implement customized alarm diagnosis or emergency response plan workflow orchestration.

Automated workflow design

The entire automated alarm processing process can be summarized into the following steps:

  1. The workflow is triggered by the alert webhook. The platform carries the alert context (time, region), as well as the Metrics query target (service name, database name, message queue topic) and abnormal threshold pre-set in the workflow.
  2. Use Parallel Task to trigger sub-workflows to perform alarm diagnosis on the operating system (CPU, Memory), infrastructure, and microservices.
  3. Each alarm diagnosis sub-workflow will go through three stages: Metrics query, analysis, and result aggregation.
  4. Finally, assemble the card to be sent to the emergency response group and send it

Figure 10: Internal process of automated workflow

Metrics Query Function

The Metrics query function is designed with the following API example, which can be connected to the Metrics platform built by Byte based on OpenTSDB. It mainly provides the following functions to greatly improve the reusability of this function.

  • Metrics query templates can be used to write go template syntax for indicators, tags, and filters and pass in values ​​from the template_values ​​field.
  • Supports querying data of multiple time periods at one time. You can define different time ranges such as 30 minutes ago, 1 day, 1 week ago, etc. under the time_ranges column, and obtain all in one function call.
  • Metrics drill-down function, in the drill_downs field, you can define additional tags on top of the original tags to obtain, for example: originally query the CPU usage of the entire service, and then query the CPU usage of each host in the service.
 {
"zone" : "xx" ,
"indicator" : "service.thrift.{{ .service_name }}.call.success.throughput" ,
"template_values" : {
"service_name" : "my_service_name" ,
"to" : "redis_cache"
},
"aggregator" : "avg" ,
"tags" : {
"idc" : "literal_or(*)" ,
"cluster" : "literal_or(*)" ,
"to" : "literal_or({{ .to }})"
},
"filters" : {
"cluster" : "my_cluster_name"
},
"rate_option" : {
"counter" : false ,
"diff" : false
},
"start_at" : "now-5m" ,
"end_at" : "now" ,
"time_ranges" : {
"5mago" : {
"start_at" : "now-10m" ,
"end_at" : "now-5m"
}
},
"drill_downs" : {
"instances" : {
"top" : 1 ,
"top_aggregator" : "max" ,
"tags" : {
"host" : "literal_or(*)"
}
}
}
}
Metrics Analysis Function

The Metrics analysis function is designed with an API as shown in the figure below, which allows you to customize the summary results (maximum, minimum, average, and total) to be analyzed for thresholds, year-on-year analysis, and even drill-down analysis of a tag in the Metrics. In addition, the comparison operator and threshold can also be adjusted at will, which provides great convenience for subsequent modification of thresholds or analysis logic.

 {
"display" : { // Required
"namePrefix" : "today" , // optional, display name prefix, default: current
"name" : "Delay" , // Required, the name of the analysis result indicator
"format" : "latencyMs" // Optional, display format of analysis result indicators. If not filled in, the output is as is and only displayed to the second decimal place. The formats supported are default, percent, latency, latencyMs
},
"summary" : "avg" , // Required, which summary data to display and analyze sum, avg, max, min, count
"threshold" : { // Optional, threshold analysis
"value" : 4 , // Required, original value threshold
"operator" : "gt" // Required, comparison operator, supports gt, gte, lt, lte, eq, ne
},
"time_ranges_percentage_difference" : { // Optional, analyze data with different time offsets
"5mago" : { // Key name, you can specify your own name
"display" : { // Required
"name" : "5-point YoY" // Required, the name to be displayed in the analysis result
},
"summary" : "avg" , // Required, which summary data to display and analyze sum, avg, max, min, count
"precondition" : { // Optional, precondition, the change rate analysis will be performed only after the original metrics meet the conditions
"value" : 4 , // required, threshold
"operator" : "gt" // Required, comparison operator, supports gt, gte, lt, lte, eq, ne
},
"threshold" : { // Optional, rate of change threshold
"value" : 0.1 , // required, threshold
"operator" : "gt" // Required, comparison operator, supports gt, gte, lt, lte, eq, ne
}
}
},
"drill_downs" : { // Optional, analyze different drill-down data
"instances" : { // Key name, you can specify the name yourself
"display" : { // Required
"name" : "Single Instance" // Required, the name displayed in the analysis result
},
"summary" : "max" , // Required, which summary data to display and analyze sum, avg, max, min, count
"threshold" : { // Optional, threshold analysis
"value" : 10 , // Optional, single instance raw value threshold
"stdDiff" : 1 , // Optional, standard deviation threshold for comparing the original value of a single instance with the average of other drill-down values
"operator" : "gt" // Required, comparison operator, supports gt, gte, lt, lte, eq, ne
}
}
},
"filter" : true , // Optional, only display analysis results that reach the threshold
"metrics" : [...] // Omitted, the data content returned by the Metrics query function
}
JavaScript Execute Function

In the steps of Metrics aggregation and robot card information assembly, the aggregation conditions of different Metrics and the display logic of robot cards are different. If they are developed separately, the reusability and development efficiency of the overall function will be reduced. Therefore, the github.com/rogchap/v8go suite was used to develop a function that can dynamically execute JavaScript on the input JSON data to handle this series of uses. JavaScript is the best way to process JSON format data. As shown below, the Array data in JSON can be grouped, sorted, reversed and mapped using native JavaScript, which is very convenient.

 {
"script" : "data.flat().map(x => x * 2).filter(x => x > 5).reverse()" ,
"data" : [
[ 1 , 2 , 3 , 4 , 5 ],
[ 6 , 7 , 8 , 9 ]
]
}
Real-world example: MySQL latency diagnosis

The following figure is an example of actual abnormality diagnosis and how to combine the above three functions. The following figure takes MySQL latency as an example. It can be seen that most MySQL latency is normally within 1s. The latency of one host suddenly rises to 20.6s. This needs to be actively discovered during emergency response and is an abnormality that may cause an emergency event.

Figure 11: MySQL delayed single instance exception

  • Query latency
    As shown in the workflow definition below, you only need to fill in the metrics used for graphing, query conditions, time range, and drill-down tags from the Grafana dashboard according to the API definition of the metrics query function mentioned above to perform metrics query.
 MetricQuery :
type : Task
next : MetricAnalysis
atomicOperationRef : metric_query
variables :
zone : xxx
indicator : mysql . latency . pct99
tags :
idc : literal_or ( * )
db : my_database
aggregator : avg
start_at : now - 30 m
end_at : now
time_ranges :
1d :
start_at : now - 1 d30m
end_at : now - 1 day
drill_downs :
instances :
top : 30
top_aggregator : max
tags :
host : literal_or ( * )
port : literal_or ( * )
  • Analysis delay
    The workflow definition below will be executed after the Metrics query function is executed. It mainly needs to provide the text showing the analysis results, the unit of the Metrics, and the thresholds for various abnormal analysis.
 MetricAnalysis :
type : Task
next : GroupResult
atomicOperationRef : metric_analysis
variables :
metrics . @ : "@.data" #Get query results from query delay function output
filter : true
display :
name : delay
format : latencyMs
summary : avg #Overall latency is considered abnormal if it exceeds 500 ms
threshold :
value : 500
operator : gt
time_ranges_percentage_difference :
1d :
display :
name : Yesterday's year-on-year
summary : avg
precondition : #If the overall average delay exceeds 200 ms , analyze the current delay and compare it with yesterday
value : 200
operator : gt
threshold : #The overall delay is considered abnormal if it exceeds 50 % of yesterday's average
value : 0.5
operator : gt
drill_downs :
instances :
display :
name : single instance
summary : max #A single MySQL instance delay exceeding 1 s is considered abnormal
threshold :
value : 1000
operator : gt
  • Grouping results
    This workflow step is relatively simple. It mainly analyzes the results of the Metrics function and groups and sorts them by specific tags. In this example, we want to use IDC (computer room) as the grouping key, so in the following workflow definition, we can introduce the JavaScript code that executes the above logic.
 GroupResult :
type : Task
end : true
atomicOperationRef : jsrun
resultSelector :
mysqlLatency . @ : "@.data" #Get query results from the analysis delay function output
variables :
data . @ : "@"
script : | #Group results for IDC tag
data = data . map ( x => x . data ). flat (). groupBy ( x => x . template_values ​​?. idc )

// Sort data and convert format
for ( const key in data ) {
data [ key ] = data [ key ].
sort (( a , b ) => a . original_value - b . original_value ).
reverse ().
map ( x => ({
... x . tags ,
usage :
current : x . value ,
"1d" : x . time_ranges_percentage_difference ? x . time_ranges_percentage_difference [ "1d" ]?. value : "No data"
},
threshold : {
current : x . threshold ,
"1d" : x . time_ranges_percentage_difference ? x . time_ranges_percentage_difference [ "1d" ]?. threshold : "No data spike"
"instances" : x . drill_downs ?. instances
}
}))
}
data

Finally, after executing the above three workflows, the following data output results can be obtained. Basically, the abnormal metrics and diagnostic conclusions have been grouped and filtered in a structured way, and the diagnostic conclusions are attached, which can be used as the input of the chatbot message.

 {
"mysqlLatency" : {
"xx" : [
{
"cluster" : "xxxx" ,
"idc" : "xx" ,
"threshold" : {
"1d" : "Average delay compared to yesterday was greater than: 50%" ,
"current" : "Current average delay is greater than: 1s" ,
"instances" : [
{
"name" : "mysql.latency.pct99{cluster=xxxx,dc=xx,host=xxx-xxx-xxx-001}" ,
"original_value" : 20600.546 ,
"tags" : {
"cluster" : "xxxx" ,
"idc" : "xx" ,
"host" : "xxx-xxx-xxx-001"
},
"threshold" : "The maximum delay of a single instance is greater than: 1s" ,
"value" : "Single instance maximum delay: 20.6s"
}
]
},
"usage" : {
"1d" : "Average delay yesterday compared to the same period last year: 62%" ,
"current" : "Current average latency: 501.49ms"
}
}
]
}
}

The indicator diagnosis related to the application container, such as CPU, Memory, or the metrics of the application itself, all follow similar logic to orchestrate the workflow. You only need to replace the query metrics and diagnosis thresholds.

income

With the above automated analysis tools, the video architecture team has benefited greatly in its daily emergency response process. In an emergency, the network of an IDC failed, as shown in the following figure: the error and delay of a certain IP were particularly high. The diagnosis automatically triggered in the emergency response processing group can directly discover such anomalies and immediately handle the abnormal instances.

Figure 12: Automated tools display summary information of abnormal indicators in incident groups

After the complete introduction of this automated process, the MTTR reduction effect is shown in the following figure. From the beginning of October 2022 to the end of January 2023, the MTTR is calculated every two weeks: from the initial 70 minutes to the current 17 minutes, an overall decrease of about 75.7%.

Summarize

In the face of such a large number of microservices and a wide variety of infrastructure dependencies, in order to be able to make decisions quickly and perform emergency operations when an emergency occurs, in addition to relatively complete monitoring, it is necessary to collect emergency response processing records on a regular basis to count high-frequency events and summarize an automated troubleshooting process to shorten MTTR.

<<:  From the ground to the sky, maybe this is the future of mobile phones?

>>:  iOS 16.4 official version is released, it is recommended to upgrade!

Recommend

up to date! Baidu 360, Sogou or Shenma search, which channel has more traffic?

Many people believe that China has a large popula...

Breast cancer rapid ranking SEO optimization case training!

The latest SEO training case: Darwin's theory...

New mobile page inspection tool: Mobile Checker

Mobile Checker (official website) is a mobile pag...

What are the operating models of short video platforms?

Now more and more companies are using short video...

How to efficiently display bitmaps on Android App

To create visually appealing apps, displaying ima...

Stranger social networking and new media operations are inseparable

New media operations basically revolve around fan...

C4D product performance first issue [HD quality with material]

C4D product performance first issue [HD quality w...

Biden says willing to shut down U.S. economy to stop coronavirus

Glodon News, August 22丨In an interview with US me...

Which one is better, Youqianhua or 360 IOU?

Recently a friend asked me which one is better, Y...

How to analyze data in product operations?

Today we will talk about the last element of the ...