Exploration and practice of intelligent film performance optimization

I. Introduction

All editing tools have the ability to create a film with one click, solving the problem of difficulty in editing and packaging special effects in video creation. The mainstream practice in the industry is generally to identify and extract highlights from the video materials uploaded by users, add template special effects packaging in the later stage, and finally produce the film. The above processing will crop the video to adapt to the length of the template to fill the pit.

Bilibili started working on the smart film-making function in July 2022. The first version only supports the "image to video" function. The core is to add simple music packaging to the image materials selected by the user and convert them into videos. The basic process is as follows:

picture

In October 2022, we started working on the second version of smart film production, which supports adding video elements and expands the dimensions of special effects packaging. In addition to the industry-standard template special effects, it also combines smart music and automatically converts user audio information into subtitles. The new smart film production business process:

picture

The first and second versions above generally complete the special effects packaging of intelligent film structure, namely the three basic intelligent elements of template, soundtrack, and ASR subtitle. Due to historical reasons, with the rapid rolling iteration of business, intelligent film only completed the requirements of rapid launch, that is, the construction from 0 to 1. No core observable indicators are defined for the performance of film. For example, there are many basic experience problems such as the effect problems reported by internal users and the long time it takes to film:

picture

This article mainly discusses the performance optimization and practice of B station's intelligent film production from the two levels of efficiency and effect.

2. Observability Data Construction

Based on the overall business flow of intelligent film formation, the three core links and the sub-links under the three main links are sorted out. First, two key available indicators of intelligent film formation are defined.

Synthesis time: the total time it takes to start and finish intelligent filming after the user selects the material. Here we use the P90 indicator as a reference

The definition and extraction of synthesis effects is relatively complex, and there are three dimensions of links that can be optimized:

● Material application success rate: Improve the material application success rate of the intelligent film-making sub-links (basic templates, soundtracks, ASR subtitles). The successful application of each sub-link means that the final restored effect will be richer.

The definition of this dimension is relatively idealistic, but the indicator of the success rate of sub-link material application is quantifiable in business terms.

●The template restores the richness of atomic capabilities: the special effects packaging set template’s own atomic capabilities are completed, and the template has richer sub-elements.

This dimension is to supplement the basic capabilities of business means and provide template effects. No explanation will be given.

●Accuracy of material recommendations: Intelligently recommended templates, music and other packaging effects can match the material content selected by the user with high accuracy.

The recommendation of templates and music depends on the picture label recognition rate of the AI model. The picture label recognition rate is mainly manually evaluated, and the recognition rate is 41% (P0 picture label recognition rate is 68%). The optimization of this part depends on the ability upgrade of the picture recognition model itself, which is not explained in detail in this article.

In the above, we finally selected "material application success rate" as the main quantitative indicator of synthesis effect. This article mainly focuses on the optimization of this dimension.

With the basic indicator definition, we organize and output the observation data that needs to be completed from the global perspective of intelligent film formation:

picture

3. Performance Optimization

Initial data showed that the P90 time for smart filming was 20 seconds (basically exceeded the time limit), and the material application success rate was 46%. The overall usability was poor.

Based on the quantification of basic data, we started from three core links to identify the points that can be optimized and then optimize them. The overall points of investigation are as follows:

picture

3.1 Template link optimization

The initial template download success rate was only 91%, and the template took 19 seconds from frame extraction recommendation to download completion P90.

The template recommends that the overall business chain from the material page to the intelligent film synthesis page is as follows:

picture

We optimize from the following key points

3.1.1 Resource duplicate download problem

There are mainly two types of resource duplicate download problems:

There is a separate music download in one type of intelligent film production business process. The template itself also carries music sub-elements, but the business flow will not use them (the music carried by the template does not match the material itself). Here we adopt the solution that the template download business layer supports on-demand download of material sub-elements (eliminating unnecessary materials).
The second historical reason is that the subtitles of Bilibili Pink Edition and Bijian font subtitles are inconsistent, while UGC templates are all produced by Bijian. After the intelligent film downloads the template resources, it needs to re-index and download the matching subtitle font resources due to the mismatch of font subtitle resources. The solution is similar. On-demand processing does not download the font subtitles carried by the template, and directly skips downloading the font subtitles required for a single intelligent film.

picture

The above are two typical resource duplication problems. By reducing unnecessary resources, download time can be saved. Finally, P90 time is reduced by 2s

3.1.2 Special resource transcoding issues

We collected the links of intelligent slice timeout (service configuration 20s is timeout), analyzed 80+ bad cases, and checked the reasons for timeout one by one. We found two scenarios with more timeouts:

picture

1) Subtitle downloads on iOS often time out for 120 seconds. After repeated attempts, we found that there was a bug in the business downloader. When downloading multiple subtitle fonts, the download link would get stuck until the download task times out for 120 seconds before returning the result.

2) When the sub-element of the template material contains GIF material, it is easy to time out. Analysis found that the third-party editing SDK used by the business has a private material format definition. The GIF material will be transcoded into the CAF format material customized by the third-party editing SDK on the template consumer side. This transcoding process takes a long time and is prone to timeout.

Question 1: Repair the service download process
Question 2: There are two directions. The first direction is to support direct rendering of GIF when restoring the material effect (due to the problem of Meishe’s custom format, similar API is not open). The second direction is to directly convert the GIF material into CAF format when the template material production end produces the material. The consumer end can directly read the CAF material for rendering, thereby reducing the time consumption of the link conversion.

picture

Based on the optimization in the second direction, a series of links need to be processed:

For existing templates containing GIF materials, they need to be cleaned and converted into CAF format. For incremental templates, the production end containing GIF materials will convert directly to CAF format by default, and the template production end needs to be modified. At the same time, a new production package will be issued to internal or external template manufacturers.
There is a self-developed editing SDK montage that supports GIF material rendering. On the premise that the pre-optimization has been converted to CAF format. There are also two solutions to deal with this kind of material compatibility issues between different SDKs:

Meishe supports reverse conversion of CAF material format into GIF format. When the self-developed SDK is finally replaced and launched, the material will be cleaned and converted from CAF to GIF. By connecting with Meishe technology, the API can provide
The self-developed editing SDK supports Meishe's proprietary CAF format, which requires reverse analysis of CAF format materials, and then decoding to support CAF format rendering. Through communication with the internal self-developed SDK technical team, this solution can also be implemented.

From the perspective of business iteration and upstream template production and maintenance costs, the preferred solution is "self-developed editing SDK supports CAF format". From the perspective of material format standardization, "Meishe supports reverse conversion of CAF material to GIF format" is preferred. In the end, we chose "self-developed editing SDK supports CAF format" to solve this problem at a low cost.

After the material format conversion and multiple subtitle issues were fixed, the P90 time was significantly reduced to 12s.

picture

3.1.3 Template resource size production standardization & version compatibility

The template materials for smart films are generally produced by internal designers. In the early days, the materials were not compressed in a standardized manner when they were put into the library, and there was no size limit for the templates when they were produced. Some of the templates produced were very large and took a long time to download. Here we optimize from two directions:

picture

Define the template volume baseline standard (20M). When the template is produced and submitted, the total volume of the template is calculated in real time. If it exceeds the standard volume, the template element volume information is displayed, and the large volume sub-elements are replaced.

picture

We promote the standardized compression of materials by the material center, connect incremental materials to transcoding services, and clean up stock materials. At the same time, we deliver materials of different qualities according to different business scenarios.

picture

Template version compatibility

picture

A template is a special effect package set, which is composed of multiple basic atomic capabilities, such as subtitles, fonts, transition effects, filters, picture-in-picture, etc., plus a standard restore protocol. The atomic capabilities of templates gradually increase with version iterations. How to design a version compatibility solution?

The simple approach is to perform version control on the atomic capabilities supported by different templates. The problem here is:

Template operators need to know the atomic capabilities and App versions supported by the template, which is difficult to understand.
Manually configuring the version number supported by the template in the material platform is inefficient.

A more reasonable approach is for the App to maintain a list of atomic capabilities that support restoration. The cloud will select a matching template list based on the template atomic capabilities supported by the App and the atomic capabilities supported by the template itself, and then send it to the App. The above solves the template distribution problem. However, there are still some situations that require version compatibility processing:

For example, after an atomic capability is upgraded, the corresponding materials cannot be backward compatible, and version isolation is required, with the new version corresponding to the new materials and the old version corresponding to the old materials.
For example, if a certain performance optimization is performed and the material format on the production side has changed (such as the GIF to CAF optimization mentioned above), it is necessary to perform version isolation on the new material.

The problem with version isolation is that manual configuration is prone to errors. In historical version iterations, there have been a few errors in version isolation information configuration due to long intervals between version releases, frequent personnel changes on the template production side, and incomplete context information, which ultimately led to failure in pulling template sub-elements, thus affecting the success rate of template downloads.

We solved the above problems by downloading error messages from templates, indexing the corresponding template sub-elements, and calibrating the template version information one by one. After the problem was solved, the template download success rate increased to 96%.

picture

3.1.4 Add preloading and backup processing to template resources

The basic practice in the industry is to use preloading and adding a backup to improve the success rate of material download applications. We have made preloading logic from three aspects

For templates that fail to download recommended templates or time out, add a pre-download backup template to ensure basic packaging effects
Pre-download templates with high conversion rates for intelligent videos are added to improve the cache hit rate of high-heat templates and reduce time consumption.
Reduce template service links, support concurrent downloads of template core zip resources and their index sub-elements

picture

At the same time, the template downloader itself has been optimized. The historical template business downloader only supports serial downloads. The new download component of the base frame is connected to the business to solve the problem of concurrent downloads.

3.2 ASR Link Optimization

The second intelligent link of intelligent film formation relies on the ASR service. The ASR service mainly analyzes audio data and outputs audio classification information: music, mixed sound, human voice, and no sound. The identification of each category depends on the proportion of each type of information:

picture

Its business links are as follows:

Points that can be optimized in the ASR link

Problem 1: The ASR service takes a long time. When the ASR link time is counted on a single line, it is found that P90 usually exceeds 20s and is unavailable.

Question 2: The ASR link pre-process includes audio file extraction and audio upload links. The audio upload link may take a long time. The main reason is historical: there is a business service and file storage service in the middle of the audio file upload link for forwarding, which is time-consuming and lossy.

picture

Problem 1: Collaborating with the AI server to find extreme cases and troubleshoot, we finally found that the ASR service interface was flushed. The service QPS was too high, which caused the ASR processing of the business to wait in line for a long time. The solution was to add the flushing task to the blacklist. After the processing, the ASR link P90 time consumption was reduced by 50%

picture

Question 2: Simply remove the upload business service middle layer. The client can directly call the basic file storage BFS service interface and then return the storage address to the AI service side, thus reducing the link.

picture

3.3 Smart Music Link

The third link of intelligent film production is music recommendation. Its basic process is as follows.

picture

There are three main dimensions of indicators for AI-based music recommendation: user characteristics, music characteristics, and picture characteristics:

User characteristics depend on basic population stratification.
Music features are records of music features that users have matched in the past and are updated once a day.
The picture features are the picture labels of the materials that the user uses for intelligent film recognition.

Music is recommended by weight based on the above three features, and the picture feature dimension is more in line with the effect of the intelligent film at that time.

During a component upgrade and replacement process, the business side passed the wrong frame extraction address to the AI service side, resulting in the inability to output the picture label. The AI side returned a music recommendation with a downgraded strategy based on user characteristics and music characteristics (low matching between music and pictures, homogeneity problem), but the business side was unaware of it.

The problem was discovered mainly because the AI team had monitoring and alarms based on the screen marking success rate. Over a period of time, the marking success rate was significantly lower than expected.

picture

Issues fixed:

The client rolls back the online editor frame upload function configuration to the old component, stopping the loss for online users.
The new version of the client fixes the problem of wrong frame address. Grayscale release based on new business configuration

Early warning of the problem: How do business testers and R&D personnel determine whether the recommended music returned is downgraded during the delivery acceptance stage? And whether the business side can perceive it more quickly after going online.

The AI server returns an error message saying that there is no image feature. The client takes two actions based on this error:

The client log prints music recommendation error information (including whether new errors contain picture features) to facilitate screening during the acceptance stage.
The client adds real-time tracking point reporting, counts the success rate of music recommendations with picture features, determines the recommendation success rate threshold based on actual conditions, and performs real-time alarm monitoring.

Through the above series of optimizations, the intelligent film P90 takes about 10 seconds, and the material synthesis success rate is 90%+

picture

4. Index anti-cracking

The previous part mainly explains the process of intelligent film performance optimization. This part mainly focuses on client monitoring and alarming of the achieved indicators to prevent data degradation. We mainly establish the overall monitoring and alarm process from the following dimensions

picture

Define indicators: This part mainly defines the key node indicators of the three core links and sub-links of the intelligent segmentation to make alarms.

picture

Set thresholds: Set reasonable thresholds for each selected indicator, including statistical time range, trigger conditions, etc.

picture

Configure alarms: Configure alarm information for each key node indicator on the Fawkes platform
Note: Fawkes is an enterprise mobile Sass platform product that provides a comprehensive mobile application development and deployment solution, which can greatly improve development efficiency and application quality, and reduce development costs and risks.

picture

Alarm response processing: After the alarm is triggered, the on-duty personnel respond to the alarm. We have defined the basic 5-10-20 principle, which means 5 minutes to discover, 10 minutes to respond, and 20 minutes to locate the problem.

There are two questions here: how to quickly notify the on-duty personnel after the alarm is triggered? And how to let the on-duty personnel quickly find the error information?

picture

We configure custom Webhook information through the Fawkes alarm platform. After the alarm is triggered, the standard Webhook configuration is parsed to filter the key log information of the alarm, and the key log information and the on-duty personnel information of the day are encapsulated through the custom Webhook and pushed to the alarm processing group.

picture

Alarm review: Each alarm message is recorded by the person on duty that day. We will review the alarm duty information regularly and continuously calibrate the alarm granularity and threshold information.

The above is a real-time alarm monitoring SOP construction, which conducts daily inspections on the three main links of the intelligent film. Regularly collect, analyze, and adjust alarm information, making alarms more accurate and improving daily duty efficiency.

V. Summary and Outlook

5.1 Summary

We first defined the core availability indicators of intelligent film production, and refined the observable data of key link nodes based on the core indicators. At the same time, we optimized the time consumption and success rate of the three links of template, ASR subtitle, and music based on the data. Finally, we established a real-time monitoring and alarm duty mechanism for the core links of intelligent film production to prevent data degradation. In the future, data, optimization, and alarm will continue to evolve.

Data: more refined, data calibrated

Optimization: Intelligent template production end material size monitoring, template material storage standardization, image recognition accuracy improvement

Monitoring alarm part: strategy alarm completion (intelligent music matching strategy), intelligent film time consumption alarm completion, alarm granularity refinement, and alignment of dual-end alarm difference items.

5.2 Future Directions

Smart Film 1.0 mainly includes templates, ASR subtitles, and special effects packaging of the three basic elements of music, and does not process the user materials themselves (Before).

Smart Film 2.0 is a product that competes with other products in the industry. It uses the ability of image recognition to intelligently extract highlights and perform intelligent editing (ing...).

Intelligent Film Production 3.0 is based on the AIGV big model, generates video content through AI, and produces it in one click (Future).

picture

Authors of this issue

Xu Huiyu Senior Development Engineer at Bilibili

<<: Integrate UniLinks with Flutter (Android AppLinks + iOS UniversalLinks)

>>: Android uses LeakCanary to detect memory leaks