ChallengesAs the content ecosystem of Douyin continues to grow, more and more large-scale events are being live-streamed on the Douyin platform, and various events such as the World Cup/Spring Festival Gala/Asian Games have attracted a large number of users to watch. During the Qatar World Cup, the stable and high-quality live broadcast images provided by Douyin brought the audience a perfect viewing experience, and the PCU of the final reached 37 million+. How to meet the challenges?Image quality optimization linkThe live broadcast of large-scale events involves a long link. There are some differences in the links of different events. The overall process can be simplified as shown in the figure below. The live signal is produced in the studio and transmitted to the CDN and then further distributed to the user side. From the perspective of image quality, the entire link can be divided into two parts: image quality detection and image quality optimization. For the link before CDN, image quality monitoring is the main purpose, with the purpose of discovering problems/locating problems/promoting corresponding link personnel to solve problems. Image quality optimization is carried out on both the CDN and the client side. The following content mainly introduces the image quality optimization part. picture With the improvement of event recording technology, more and more large-scale events are using 4K HDR recording standards, and the picture quality and clarity are constantly improving, which brings with it greater bandwidth pressure. At the same time, in order to be compatible with different viewing devices and different bandwidth conditions on the consumer side, the server needs to convert a variety of versions with different resolutions and bit rates for the viewer to choose from. In order to ensure that users can get the best picture quality experience under different bandwidths and devices, we have done a lot of optimization work. The team has effectively improved the picture quality of the event through self-developed adaptive ToneMapping, video noise reduction, ROI encoding, video interpolation, BAS sampling, and on-end super-resolution algorithms. Adaptive ToneMapping: Currently, most large-scale events are recorded using HDR (high dynamic range) equipment. The team has added HDR gears to devices that support HDR viewing and broadcasting, and also provided gears with a variety of different resolutions/frame rates. HDR-shot sources have a wider color gamut and a larger dynamic range. However, for many terminal display devices, HDR signal playback is not supported, so it is very necessary to convert HDR signals into SDR (standard dynamic range) signals through the ToneMapping algorithm. picture Compared with SDR signals, HDR signals have a wider color gamut and a larger dynamic range. In the process of converting to SDR signals, some information loss is inevitable. Some commonly used ToneMapping methods, whether Reinhard, Filmic or Hable, are essentially designed to convert from HDR signals to SDR signals by designing fixed mapping curves, while trying to restore the HDR effect. However, the live broadcast scene is changeable, and the dynamic range of the scene is very large. For example, the brightness of the lights/grass/players in the stadiums in the World Cup is obviously different, and the span of different shots is very large. In the Asian Games game competitions, the CG pictures are relatively stable. The existing ToneMapping algorithm cannot achieve excellent and stable results in changing scenes, and it is not realistic to manually adjust the conversion parameters for each game. In order to solve this problem, the team proposed a content-adaptive ToneMapping algorithm, which dynamically performs ToneMapping by counting the actual lighting conditions of the video content, thereby obtaining better results.
picture picture In the main audience test, the optimized content adaptive ToneMapping algorithm is far ahead of the existing TonaMpaping algorithm results (the control task is the team's self-developed results) picture BAS sampling: The BAS (Byte AI Scaling) algorithm is a deep learning-based image/video downsampling algorithm developed by Byte. In recent years, deep learning-driven video processing algorithms have been widely used in various on-demand and live broadcast services, covering many business lines such as Douyin and Xigua Video. In the actual streaming media transmission link, based on factors such as the user's actual network delay and terminal performance, the source stream will be transmitted to the terminal device through an adaptive bit-rate strategy to optimize the user's actual experience. Among them, the video stream is often sampled to multiple standard resolutions, such as Blu-ray (1080p), HD (720p), SD (480p), etc. With the development of the audio and video industry and photographic equipment, the proportion of high-resolution video sources is increasing. Most videos need to be downsampled on the server side to cooperate with the adaptive bit rate strategy. Therefore, the optimization of the downsampling algorithm is also the key to improving QoE. In past industry practices, video processing algorithms often focus on processing paradigms that increase resolution (such as super-resolution algorithms) or maintain resolution (such as noise reduction algorithms), and almost ignore the research on methods to reduce resolution. Different from the fixed operator bicubic downsampling algorithms, the BAS algorithm uses high-precision data training models based on deep learning to alleviate the frequency domain aliasing and frequency domain truncation problems caused by traditional methods, reduce jaggedness, and reduce detail loss. As shown in the figure below, for the task of downsampling a 4K ultra-high-definition image source to 480p resolution, the left figure is the BAS algorithm processing result, and the right figure is the traditional bicubic algorithm processing result. It can be clearly seen that the BAS algorithm processing result alleviates edge jaggedness (lower left), eliminates moiré (lower right), and the detailed textures of light signs, auditoriums, etc. are clearer, and the visual perception is better.
picture In a quantitative comparison with the bicubic algorithm, BAS achieved a BD-Rate benefit of -20.32% based on the PSNR index, which means that it can save more than 20% of the bit rate at the same reconstruction error level, and improve the image quality at the same bit rate. For the VMAF index that is more in line with the human eye's perception characteristics, BAS also achieved a BD-Rate benefit of -20.89%. picture Under commonly used encoding conditions, the BAS algorithm can reduce the average bit rate of UGC videos by 6.12% while improving a number of key subjective and objective image quality indicators. This can not only reduce part of the transmission bandwidth, but also improve the image quality, achieving a win-win situation in terms of cost and experience. Video interpolation: In the practice of large-scale Tik Tok events, various recording standards are encountered, including the 1080P 25fps recording standard. Consumers are now accustomed to the smooth video experience of high frame rates. For low frame rate videos, they will obviously feel that the smoothness of the picture is reduced, affecting the user's viewing experience. For low frame rate scenes, we use intelligent interpolation technology to estimate the optical flow of the content of the previous and next frames, convert the pixels of the previous and next frames to the middle frame according to the optical flow information, and then integrate them to generate the middle frame, thereby improving the video frame rate and reducing the sense of stuttering when watching. For e-sports scenes with higher frame rate requirements, we have made the following additional optimizations. picture The faster optical flow module and faster correction module use partial conv instead of ordinary convolution, which can reduce convolution operations while maintaining the effect; when calculating the optical flow, content adaptive downsampling is used to downsample the input for calculating the optical flow, residual and occlusion mask, and then upsampled back to the original resolution for warping and integration of the original input. Since the optical flow module and the correction module, which are two modules with more operations, receive a smaller resolution, the amount of calculation can be further reduced; in engineering, the team reduces IO and floating-point operations through operator fusion and half-precision, which is more than 1 times faster than before engineering. At the same time, the ability of intelligent interpolation has been expanded through multi-GPU deployment, so that it can be deployed in higher resolution (4k) scenarios. On the other hand, in e-sports scenarios, such as Honor of Kings, each hero has the name of the player above him. These characters are relatively small and will move with the complex movements of the hero, which will lead to complex movements of the small characters. Smart interpolation usually results in inaccurate positions of the interpolated frame characters on these complex moving small characters due to inaccurate optical flow estimation, resulting in artifacts. We add more randomly moving or stationary smaller characters during the training process, so that the model can pay more attention to processing the complex movements of small characters during training, thereby achieving better interpolation effects, as shown in the following figure. The left side is the optimized interpolation result.
picture ROI encoding: In order to balance the video bit rate and subjective image quality, the team used the time-domain ROI technology based on LSTM (Long Short-Term Memory Network). By combining the detection of the salient areas of the human eye and encoding, the bit rate distribution on the screen was made more reasonable. In addition to model design, another major difficulty in the ROI algorithm is the acquisition of saliency (salient object detection) data. The general saliency data set does not perform well in large-scale events. To address this problem, the team collected and produced its own dedicated data set, and made dedicated data sets for some large-scale events. For example, for the World Cup, the team specially produced a saliency data set for football scenes. The eye tracker tracks the focus area of fans when watching the game to obtain a dedicated saliency data set for football games, which greatly increases the accuracy of the model. In view of the characteristics of the large number of salient objects and scattered salient areas in football scenes, the team has specially optimized the detection model. While ensuring the detection speed, the recall rate of the model and the robustness of different scenes are improved, thereby achieving better subjective quality.
picture picture At the same time, the team uses a video noise reduction algorithm to remove spatial and temporal noise based on video information, and processes noisy videos into clean, noise-free videos. Since the noise of the video is removed, the transmission bit rate is reduced while improving the video quality. Due to the limitation of the user-side network speed, there are multiple gears on the end. When the viewing speed is slow, it may switch to low-resolution gears such as 480P/720P. At this time, the super-resolution algorithm on the end will be triggered to improve the picture clarity. Super-resolution technology refers to the technology that reconstructs the missing details of low-resolution videos by spatial and temporal modeling based on video information based on machine learning/deep learning methods. In this way, you can experience clearer picture quality even at low-resolution gears.
picture In addition, the team also provides high resolution, high frame rate, wide color gamut, and uses a variety of image quality enhancement technologies such as color enhancement and adaptive sharpening to present a more immersive ultra-high-definition picture.
picture |
<<: The new generation of BVC encoder for Douyin live broadcast is officially launched
>>: Android system service WindowManagerService (WMS)
The HTML5 Web Storage API makes it possible to st...
What is information flow advertising? Information...
Recently, I often hear information flow advertise...
Xiaohan emoticon package teaching, sell 1288 Kuai...
404 pages are a real bummer. Imagine that you are...
Preface Due to the openness of the Android system...
Today, the competition in online bidding promotio...
I have three Tik Tok communities , which contain ...
When it comes to information flow , I think most ...
Many times, when we organize a marketing campaign...
In 2020, affected by the COVID-19 pandemic, the o...
Introduction to the special video course resource...
Most people who run communities have experienced ...
In fact, you only need to do these three things t...
Mini Programs are the last blue ocean of traffic ...