The practice of optimizing image quality for large-scale live broadcasts on Douyin

Challenges

As the content ecosystem of Douyin continues to grow, more and more large-scale events are being live-streamed on the Douyin platform, and various events such as the World Cup/Spring Festival Gala/Asian Games have attracted a large number of users to watch. During the Qatar World Cup, the stable and high-quality live broadcast images provided by Douyin brought the audience a perfect viewing experience, and the PCU of the final reached 37 million+.

How to meet the challenges?

Image quality optimization link

The live broadcast of large-scale events involves a long link. There are some differences in the links of different events. The overall process can be simplified as shown in the figure below. The live signal is produced in the studio and transmitted to the CDN and then further distributed to the user side. From the perspective of image quality, the entire link can be divided into two parts: image quality detection and image quality optimization. For the link before CDN, image quality monitoring is the main purpose, with the purpose of discovering problems/locating problems/promoting corresponding link personnel to solve problems. Image quality optimization is carried out on both the CDN and the client side. The following content mainly introduces the image quality optimization part.

picture

With the improvement of event recording technology, more and more large-scale events are using 4K HDR recording standards, and the picture quality and clarity are constantly improving, which brings with it greater bandwidth pressure. At the same time, in order to be compatible with different viewing devices and different bandwidth conditions on the consumer side, the server needs to convert a variety of versions with different resolutions and bit rates for the viewer to choose from. In order to ensure that users can get the best picture quality experience under different bandwidths and devices, we have done a lot of optimization work. The team has effectively improved the picture quality of the event through self-developed adaptive ToneMapping, video noise reduction, ROI encoding, video interpolation, BAS sampling, and on-end super-resolution algorithms.

Adaptive ToneMapping: Currently, most large-scale events are recorded using HDR (high dynamic range) equipment. The team has added HDR gears to devices that support HDR viewing and broadcasting, and also provided gears with a variety of different resolutions/frame rates. HDR-shot sources have a wider color gamut and a larger dynamic range. However, for many terminal display devices, HDR signal playback is not supported, so it is very necessary to convert HDR signals into SDR (standard dynamic range) signals through the ToneMapping algorithm.

picture

Compared with SDR signals, HDR signals have a wider color gamut and a larger dynamic range. In the process of converting to SDR signals, some information loss is inevitable. Some commonly used ToneMapping methods, whether Reinhard, Filmic or Hable, are essentially designed to convert from HDR signals to SDR signals by designing fixed mapping curves, while trying to restore the HDR effect. However, the live broadcast scene is changeable, and the dynamic range of the scene is very large. For example, the brightness of the lights/grass/players in the stadiums in the World Cup is obviously different, and the span of different shots is very large. In the Asian Games game competitions, the CG pictures are relatively stable. The existing ToneMapping algorithm cannot achieve excellent and stable results in changing scenes, and it is not realistic to manually adjust the conversion parameters for each game. In order to solve this problem, the team proposed a content-adaptive ToneMapping algorithm, which dynamically performs ToneMapping by counting the actual lighting conditions of the video content, thereby obtaining better results.

Left: Content-adaptive ToneMapping, Right: Hable algorithm

picture

In the main audience test, the optimized content adaptive ToneMapping algorithm is far ahead of the existing TonaMpaping algorithm results (the control task is the team's self-developed results)

picture

BAS sampling: The BAS (Byte AI Scaling) algorithm is a deep learning-based image/video downsampling algorithm developed by Byte. In recent years, deep learning-driven video processing algorithms have been widely used in various on-demand and live broadcast services, covering many business lines such as Douyin and Xigua Video. In the actual streaming media transmission link, based on factors such as the user's actual network delay and terminal performance, the source stream will be transmitted to the terminal device through an adaptive bit-rate strategy to optimize the user's actual experience. Among them, the video stream is often sampled to multiple standard resolutions, such as Blu-ray (1080p), HD (720p), SD (480p), etc. With the development of the audio and video industry and photographic equipment, the proportion of high-resolution video sources is increasing. Most videos need to be downsampled on the server side to cooperate with the adaptive bit rate strategy. Therefore, the optimization of the downsampling algorithm is also the key to improving QoE. In past industry practices, video processing algorithms often focus on processing paradigms that increase resolution (such as super-resolution algorithms) or maintain resolution (such as noise reduction algorithms), and almost ignore the research on methods to reduce resolution. Different from the fixed operator bicubic downsampling algorithms, the BAS algorithm uses high-precision data training models based on deep learning to alleviate the frequency domain aliasing and frequency domain truncation problems caused by traditional methods, reduce jaggedness, and reduce detail loss. As shown in the figure below, for the task of downsampling a 4K ultra-high-definition image source to 480p resolution, the left figure is the BAS algorithm processing result, and the right figure is the traditional bicubic algorithm processing result. It can be clearly seen that the BAS algorithm processing result alleviates edge jaggedness (lower left), eliminates moiré (lower right), and the detailed textures of light signs, auditoriums, etc. are clearer, and the visual perception is better.

The left picture shows the BAS sampling result, and the right picture shows the Bicubic sampling result

picture

In a quantitative comparison with the bicubic algorithm, BAS achieved a BD-Rate benefit of -20.32% based on the PSNR index, which means that it can save more than 20% of the bit rate at the same reconstruction error level, and improve the image quality at the same bit rate. For the VMAF index that is more in line with the human eye's perception characteristics, BAS also achieved a BD-Rate benefit of -20.89%.

picture

Under commonly used encoding conditions, the BAS algorithm can reduce the average bit rate of UGC videos by 6.12% while improving a number of key subjective and objective image quality indicators. This can not only reduce part of the transmission bandwidth, but also improve the image quality, achieving a win-win situation in terms of cost and experience.

Video interpolation: In the practice of large-scale Tik Tok events, various recording standards are encountered, including the 1080P 25fps recording standard. Consumers are now accustomed to the smooth video experience of high frame rates. For low frame rate videos, they will obviously feel that the smoothness of the picture is reduced, affecting the user's viewing experience. For low frame rate scenes, we use intelligent interpolation technology to estimate the optical flow of the content of the previous and next frames, convert the pixels of the previous and next frames to the middle frame according to the optical flow information, and then integrate them to generate the middle frame, thereby improving the video frame rate and reducing the sense of stuttering when watching. For e-sports scenes with higher frame rate requirements, we have made the following additional optimizations.

picture

The faster optical flow module and faster correction module use partial conv instead of ordinary convolution, which can reduce convolution operations while maintaining the effect; when calculating the optical flow, content adaptive downsampling is used to downsample the input for calculating the optical flow, residual and occlusion mask, and then upsampled back to the original resolution for warping and integration of the original input. Since the optical flow module and the correction module, which are two modules with more operations, receive a smaller resolution, the amount of calculation can be further reduced; in engineering, the team reduces IO and floating-point operations through operator fusion and half-precision, which is more than 1 times faster than before engineering. At the same time, the ability of intelligent interpolation has been expanded through multi-GPU deployment, so that it can be deployed in higher resolution (4k) scenarios. On the other hand, in e-sports scenarios, such as Honor of Kings, each hero has the name of the player above him. These characters are relatively small and will move with the complex movements of the hero, which will lead to complex movements of the small characters. Smart interpolation usually results in inaccurate positions of the interpolated frame characters on these complex moving small characters due to inaccurate optical flow estimation, resulting in artifacts. We add more randomly moving or stationary smaller characters during the training process, so that the model can pay more attention to processing the complex movements of small characters during training, thereby achieving better interpolation effects, as shown in the following figure. The left side is the optimized interpolation result.

The left side shows the result after optimization, and the right side shows the result before optimization.

picture

ROI encoding: In order to balance the video bit rate and subjective image quality, the team used the time-domain ROI technology based on LSTM (Long Short-Term Memory Network). By combining the detection of the salient areas of the human eye and encoding, the bit rate distribution on the screen was made more reasonable. In addition to model design, another major difficulty in the ROI algorithm is the acquisition of saliency (salient object detection) data. The general saliency data set does not perform well in large-scale events. To address this problem, the team collected and produced its own dedicated data set, and made dedicated data sets for some large-scale events. For example, for the World Cup, the team specially produced a saliency data set for football scenes. The eye tracker tracks the focus area of fans when watching the game to obtain a dedicated saliency data set for football games, which greatly increases the accuracy of the model. In view of the characteristics of the large number of salient objects and scattered salient areas in football scenes, the team has specially optimized the detection model. While ensuring the detection speed, the recall rate of the model and the robustness of different scenes are improved, thereby achieving better subjective quality.

Note: The red box indicates the ROI area. The left side is the general solution result, and the right side is the optimized result.

picture

At the same time, the team uses a video noise reduction algorithm to remove spatial and temporal noise based on video information, and processes noisy videos into clean, noise-free videos. Since the noise of the video is removed, the transmission bit rate is reduced while improving the video quality. Due to the limitation of the user-side network speed, there are multiple gears on the end. When the viewing speed is slow, it may switch to low-resolution gears such as 480P/720P. At this time, the super-resolution algorithm on the end will be triggered to improve the picture clarity. Super-resolution technology refers to the technology that reconstructs the missing details of low-resolution videos by spatial and temporal modeling based on video information based on machine learning/deep learning methods. In this way, you can experience clearer picture quality even at low-resolution gears.

Left: Video before noise reduction, Right: Video after noise reduction

picture

In addition, the team also provides high resolution, high frame rate, wide color gamut, and uses a variety of image quality enhancement technologies such as color enhancement and adaptive sharpening to present a more immersive ultra-high-definition picture.

Left: before the end-user super-scoring; Right: after the end-user super-scoring.

picture

<<: The new generation of BVC encoder for Douyin live broadcast is officially launched

>>: Android system service WindowManagerService (WMS)

How to solve the high user churn rate? Here are 10 strategies

Recommend

HTML5 – Local Storage

The HTML5 Web Storage API makes it possible to st...

Advertising skills on Weibo, Momo and other information flow platforms

What is information flow advertising? Information...

I analyzed the data of 176 information flow advertisements and found that as long as these 4 points are achieved, it will be a good advertisement!

Recently, I often hear information flow advertise...

Xiaohan emoticon package teaching, sell 1288 Kuaishou Douyin emoticon package project outside, earn rice according to the number of views [contains 10,000 emoticon package materials]

Xiaohan emoticon package teaching, sell 1288 Kuai...

The practice of optimizing image quality for large-scale live broadcasts on Douyin

Challenges

How to meet the challenges?

Image quality optimization link

How to solve the high user churn rate? Here are 10 strategies

Daolang Song Collection (2001-2018) Album/Single

After getting the iPhone X, be sure to try these new features that others don’t have

You will learn the activation and operation techniques used by Taobao and JD.com at a glance!

The former product operations director of Maopu.com talks about ten insights on new media operations

5 steps to quickly write high-conversion information flow advertising copy

In-depth analysis: Does the iPhone retain its value?

How to quickly increase followers on Douyin’s “grass-growing” account: 4 must-know points

Haagen-Dazs: The magic weapon of store inspection [WeChat corporate account case]

An event planning and execution form template

Recommend

HTML5 – Local Storage

Advertising skills on Weibo, Momo and other information flow platforms

I analyzed the data of 176 information flow advertisements and found that as long as these 4 points are achieved, it will be a good advertisement!

Xiaohan emoticon package teaching, sell 1288 Kuaishou Douyin emoticon package project outside, earn rice according to the number of views [contains 10,000 emoticon package materials]

How to write the copy for the 404 page? Teach you 4 little tips!

How to use Android image resources to create a more sophisticated APP

How to write creative bidding ads!

We studied 500 vertical screen ads and revealed the 5 rules for making Tik Tok hits!

After working on information flow for half a year, I found these 3 points to be super useful!

3 steps to disassemble event operations!

Where is the traffic pool for educational institutions in 2021?

Industry Research Special Video Course

Community Operation | Master these 3 steps to make your community "live"

How to do a good job in Zhihu promotion and traffic generation?

The misunderstood dispute between BAT and other small apps