Tik Tok Android Image Optimization Practice

Background

Why does Tik Tok need to continuously optimize its image capabilities?

As one of the most basic capabilities of Douyin, image capabilities serve all of Douyin's businesses. With the growth of Douyin's image services such as text, e-commerce, and IM, the amount of image loading is increasing, and the corresponding image bandwidth costs are also increasing. In order to reduce image costs and improve the user's image browsing experience, it is necessary to continuously explore and optimize image capabilities, improve image loading speed, reduce overall image costs, and achieve "good, fast, and economical" images while ensuring the quality of image display.

About BDFresco

BDFresco is a general-purpose basic network image framework for Android that was expanded and optimized by the veImageX team of Volcano Engine based on the open source Fresco. It mainly provides capabilities such as image network loading, image decoding, basic image processing and transformation, image service quality monitoring and reporting, self-developed HEIF software decoding, memory caching strategy, and cloud control configuration distribution. It has currently covered almost all ByteDance apps.

The following will introduce what optimizations Douyin has made in the image direction based on BDFresco from the perspective of Douyin.

Optimization ideas

The complete loading process of a network image is as follows:

The client obtains business data through the network, and the response content includes the corresponding image data. By handing the image URL data to BDFresco for loading, the image loading process officially begins. BDFresco will determine whether the current image is in the memory cache and disk cache. If it exists, it will perform the corresponding decoding or rendering operation. If not, it will directly download through veImageX-CDN, download the image resources to the local computer, and then perform decoding and rendering operations.

The image loading process not only occupies client memory, storage, CPU and other resources, but also consumes network traffic and server resources.

The image loading process is essentially a multi-level cache logic. The image loading process can be divided into four core stages: memory cache, image decoding, disk cache, and network loading. Combined with the indicator monitoring system, each stage can be optimized separately:

Memory cache optimization : The current Android memory cache hit rate is as high as 50%. The memory cache allows us to quickly access images at the cost of occupying the App's valuable memory. However, the existence of the memory cache does not directly lead to a serious OOM or freeze in the App. On the contrary, a reasonable memory cache configuration based on specific scenarios can reduce the frequent decoding and memory application of images, and even bring about OOM and ANR optimization.
Image decoding optimization : When memory caching fails, the image file will be decoded and eventually exist in the memory in the form of a bitmap. Currently, the average size of the decoded bitmap is 800KB, the 90th percentile is 5MB, and the 99th percentile is an exaggerated 11MB. The decoding process requires frequent memory requests. At the same time, more than 15% of the images have a waste of double size, which has a great impact on the client's performance. Therefore, how to reduce memory requests in the decoding stage is a key issue we need to solve.
Disk cache optimization : Although the disk cache hit rate is only 10% compared to the memory cache hit rate, in theory the bitmap in the memory has a corresponding original file on the disk. Therefore, if we want to improve the overall cache hit rate, we pay more attention to the optimization of the disk cache and make storage space utilization more efficient through reasonable disk configuration.
Network loading optimization : Although the failure rate in the network stage is as high as 2.5%, after data investigation and repair, the actual failure rate is less than 0.1%, and there is not much room for optimization. Considering that network loading is the longest time-consuming part of the entire process, accounting for nearly 90% of the time, the main impact is the long loading time caused by the large file size. Therefore, it is necessary to focus on solving the problem of sending large files and optimizing the network loading time.

Optimization process

Indicator Construction

Before optimizing images, it is necessary to complete a data inventory of the overall image quality. Index construction is a crucial step. By establishing an index system, we can understand the current status of images, determine the optimization direction, and evaluate the effect of optimization.

BDFresco provides log reporting capabilities. The reported image logs are cleaned by veImageX cloud data, and finally the image quality related indicators can be viewed in the veImageX cloud console. A complete data monitoring system has been established from triggering image loading to memory, decoding, disk, and network stages, covering hundreds of indicators such as loading time, success rate, client and CDN cache hit rate, file size, memory usage, and large image abnormality monitoring at each stage.

Specific measures

1 Memory cache optimization

1.1 Memory search optimization

Memory Cache Principle

BDFresco uses the Producer/Consumer interface to implement the image loading process, such as network data acquisition, cache data acquisition, image decoding and other tasks. Different stages are handled by different Producer implementation classes. All Producers are nested one layer at a time, and the results are consumed by Consumers. A simplified image memory cache logic is as follows:

Among them, reading memory or disk cache is matched through cache key, and cache key is converted through Uri, which can be simply understood as cacheKey==uri. Douyin has previously launched an experiment to optimize cache key: for different domain names of the same resource, the host and query parameters will be removed, that is, cacheKey is simplified to scheme://path/name

Optimization plan

When the business is loading images, BDFresco supports passing in a Uri array. The Uri are all the same resource and point to different veImageX-CDN addresses. In fact, the batch of Uri (ABC) will be recognized as the same cache key internally.

As shown in the figure below, the three Uri of ABC are not executed completely in the order of [A full process search -> B full process search -> C full process search], but will first perform a memory cache search on ABC, and then perform a full process search on ABC in order.

Since ABC are the same resource but with different domain names, the cache keys generated on the client are the same. In fact, the memory cache lookup of ABC is an invalid operation. Since this link is executed in the UI thread and there are multiple image scenes in TikTok, one slide will trigger multiple image loading logics. Therefore, some scenes will cause freezes and frame drops.

By removing unnecessary memory search processes, the overall frame rate is significantly improved.

1.2 Splitting the dynamic and static image cache

The memory cache size of Tik Tok images is configured based on the Java heap memory size, and the default size is 1/8, that is, 32M or 64M. Since Android 8, image memory data is no longer stored on the Java heap, but on the native heap. It is unreasonable to continue to use the heap memory size to configure the image memory cache size. Therefore, by multiplying the memory cache size by 2, it is hoped that decoding operations can be reduced and OOM and ANR indicators can be optimized.

The stability indicators after the experiment showed that although OOM was reduced, the problem was converted into native crashes and ANR, which were significantly deteriorated. The experiment did not meet expectations.

The cache hit rate of an image is positively correlated with the cache size. The larger the cache size, the higher the hit rate. However, as the cache size increases, the room for improving the hit rate becomes smaller and smaller.

Based on the experimental results, simply increasing the cache size will cause the memory water level to rise, causing ANR and native crash problems, and this solution is not feasible.

Currently, the memory cache of animated images and static images uses the same cache block. BDFresco's cache management uses the LRU elimination strategy. If too many animated frames are played, the static image cache is easily replaced. Switching back to the static image requires re-decoding, which is bound to result in performance loss and reduced user experience. There are many such scenarios on TikTok, such as IM and personal page mixed scenes of animated and static images.

At the same time, considering that there is not much room for improving the hit rate by directly increasing the memory cache size, we try to isolate the dynamic image and static image caches. Using one memory cache each for dynamic and static images can effectively improve the hit rate and reduce decoding operations .

Final experimental benefits:

By splitting the cache of static and dynamic images, Tik Tok keeps the size of a single cache unchanged, increases the overall cache size, significantly improves daily activity, significantly reduces OOM, and significantly improves the overall frame rate.
By splitting the dynamic and static image caches, Douji reduced the size of a single cache to 1/2, while keeping the overall cache unchanged. The daily activity increased significantly, the average usage time per person increased significantly, OOM decreased significantly, and the overall disk frame rate increased significantly.

2 Image decoding optimization

2.1 Decoding format optimization

Memory size Image length Image width

The number of bytes occupied by a unit pixel

The number of bytes per pixel is determined by the color mode Bitmap.Config, i.e. ARGB color channels, which mainly have 6 types:

ALPHA_8: There is only one alpha channel, 8 bits, and each pixel occupies 1 Byte;
ARGB_4444: contains 4 channels of red, green, blue and alpha, each channel is 4 bits, and each pixel occupies 2 bytes;
ARGB_8888: contains 4 channels: red, green, blue and alpha, each channel is 8 bits, and each pixel occupies 4 bytes;
RGB_565: contains 3 channels of red, green and blue, of which red occupies 5 bits, green occupies 6 bits, blue occupies 5 bits, and each pixel occupies 2 bytes;
RGBA_F16: contains 4 channels of red, green, blue and alpha, each channel is 8 bits, and each pixel occupies 4 bytes;
HARDWARE: Special configuration of ARGB_8888, the Bitmap will be stored directly in the video memory.

Currently, Douyin mainly uses two configurations, ARGB_8888 and RGB_565. ARGB_8888 supports transparent channels and has higher color quality. RGB_565 does not support transparent channels, but the overall memory usage is reduced by half. The optimization ideas of Douyin are as follows:

Low-end machines use RGB_565 for decoding by default to reduce memory usage.
Some images on Douyin do not carry transparent channels, such as all heic images, but the business specifies ARGB_8888, which causes invalid occupation of the transparent channel and waste of memory. Therefore, images without transparent channels can be forcibly downgraded to RGB_565 during the decoding stage, reducing memory usage and decoding performance loss by nearly half at the expense of a certain degree of color quality.

Because some bitmap operations such as rounded corners and Gaussian blur rely on the rendering of transparent channels, if images without transparent channels are forcibly downgraded to 565, some services may not be displayed normally. Therefore, whitening processing is required for such services.

2.2 heif decoding memory optimization

Optimization principle:

The original logic of heic image decoding in BDFresco is to call the decoding interface of the decoder through jni, return the decoded pixel data, return it to the java layer and convert it into a Bitmap object for display. The original logic has the problem of using very large temporary objects, which will cause java memory overhead and GC. After optimization, the creation of large objects is reduced, and the Bitmap object construction is completed directly in the native layer, which is expected to reduce the time consumption of heif image decoding and improve a certain degree of fluency.

Change the original heif image decoding process from:

Optimize to process:

Before the fix : Two large arrays are used when decoding each heic image:

The original data of the image, the size is the image file size, usually between 40K-700K
Image decoded data: size is 4 x width and 4 x height, usually between 1-11M

After the fix: No large arrays are used in the Java layer, and only a 40K-700K native layer DirectByteBuffer array is used. This reduces the creation of two large arrays in the Java layer, reduces the probability of GC and OOM problems caused by the creation of large arrays, and thus brings fluency and ANR benefits.

Experiments were conducted on TikTok, and performance-related indicators were significantly improved: Java memory usage was reduced, HEIC decoding time was reduced, and Android ANR was reduced, which significantly increased the consumer market for pictures and texts and boosted the overall usage time benefits.

2.3 Adaptive Control Decoding

Earlier, we mentioned that more than 15% of the images have a waste of double size, which results in a large amount of memory being required in the decoding stage. However, such a large bitmap is not needed to be displayed on the control. We decode the image by resizing it to the control size and finally decode a small-resolution bitmap, which can maximize the decoding memory application.

However, considering that image waste is mainly caused by overly large images sent by the server, simply limiting the size during the decoding stage cannot solve the problem of large images in the network stage. The problems of bandwidth waste and long network loading time are still not solved. Therefore, we have pre-migrated this stage and optimized it during the network loading stage. For specific solutions, please refer to the on-demand scaling solution in Section 4.2.

3 Disk cache optimization

By optimizing the client's disk cache configuration to improve the cache hit rate and reduce the number of image requests, the image bandwidth cost can be reduced while increasing the image loading speed.

There are three types of disk cache: main disk, small disk, and independent disk. Each disk has an upper limit and uses an LRU replacement algorithm. Currently, Douyin mainly uses main disks and independent disks. The overall process is as follows: Pictures are stored on the main disk by default, and the probability of pictures being replaced is high. If the business specifies an independent disk cacheName, the specified picture will use a separate disk, and the probability of being replaced is low.

Increased main disk storage space: The upper limit of Douyin's Android storage space is 40M. Considering that this value is the default value of fresco, the configuration value mainly refers to the storage space of the device at that time. Therefore, for devices with more storage space, you can increase the image storage configuration and improve the disk cache hit rate.

The experimental results show that as the storage space increases, the disk cache hit rate increases significantly, which further reduces the size of images. When the image storage limit is increased to 80M, the Android disk size decreases by 5%.

Independent disk promotion: For image scenarios with high reuse rates, it is recommended to connect to an independent disk cache, which can reduce the chance of being replaced by other business images LRU and improve the disk cache hit rate of images.
Taking IM emoticons as an example, we pulled the image cache hit rate data for IM business for analysis and found that the hit 表情包was only 7%. Compared with 28% of ordinary IM images and 31% of 个人页主态, which also use independent disks, the disk hit rate of emoticons is relatively low.
After connecting IM emoticon packages to independent disks, the number of emoticon package requests decreased by 27%

4 Network Loading Optimization

4.1 Image format optimization

Common image formats

image: original image, not compressed by veImageX.
JPEG: The full name is Joint Photographic Experts Group, released in 1992. It is a lossy compressed raster image file format. The higher the compression rate, the worse the image quality. It also does not support transparent channels.
PNG: The full name is Portable Network Graphics. It was published as the knowledgeable RFC 2083 in March 1997 and as an ISO/IEC standard in 2004. PNG is also a raster graphics format, but it supports lossless compression and also supports carrying transparent channel information.
WebP: is an image format developed by Google and released in 2010. It supports lossy and lossless compression image file formats, providing higher compression rates and faster loading speeds. Compared with jpeg and png formats, the file size can be reduced by 30%+ with the same image quality. At the same time, the WebP image format also supports transparent channels and animations. Currently, all versions of Douyin Android support the Webp format.
HEIC (BVC1): Images encapsulated based on the Volcano Engine's self-developed BVC algorithm (17 first prizes, Volcano's self-developed encoder won multiple championships in the MSU competition [https://www.toutiao.com/article/6951287905268843011/?upstream_biz=doubao&source=m_redirect]), usually with the file suffix heic. Compared with the Webp format, the file size can be reduced by 30%+ under the same image quality, and the bandwidth benefit is more obvious. However, the heic format also has disadvantages: due to efficient encoding, the decoding performance loss will increase slightly, but the smaller size will also reduce the network time consumption, and the total loading time will be basically the same or slightly reduced. At present, Douyin's Android terminal has fully used the self-developed BVC software solution to achieve decoding.
vvic: ByteDance's self-developed image format based on the BVC2 algorithm, which uses the VVC image encoding format, also known as the BVC2 encoding format, which has a higher compression rate than HEIC's BVC1.

heic format promotion

Currently, the best supported format by the veImageX platform is the heic encoding format. However, by the beginning of 2022, the coverage rate of Douyin's Android terminal was less than 50%. Directly increasing the proportion of heic in the business can significantly reduce bandwidth costs and increase image loading speed.

JPEG->heic, greatly reducing bandwidth costs by more than 80% and increasing loading speed by more than 30%
webp->heif, the average file size of personal page animated images decreased by 25.33%, and the loading speed increased by more than 30%

When promoting the heif animated picture experiment, we found that the frame rate of the personal page UI had deteriorated significantly, with a frame rate drop of 6-8 frames on both high-end and low-end devices. The experiment could not be launched. To address this issue, we optimized the decoding cache logic of the heif animated picture and proposed an independent cache optimization solution for the heif animated picture.

heif dynamic image independent cache

Principle of dynamic image

After the image file is downloaded and parsed into a byte stream, BDFresco will pre-decode it before the animated image is officially played. When the animated image is officially played, the Bitmap will be rendered to the screen according to the playback order of the animated image scheduler, and the next frame will be pre-decoded during the playback process. For example, if the 5th frame needs to be played, the 6th frame rate will be decoded synchronously. The pre-decoding operation is performed in the child thread.

The core difference between different schedulers is: when the child thread pre-decoding speed is too slow and the Bitmap to be played in the next frame does not exist, it should continue to return to the current frame to repeat the playback and wait for the child thread to decode, or return to the next frame and decode and render directly in the main thread.

SmoothSlidingFrameScheduler: The default scheduler. When the pre-decoding speed of the child thread cannot keep up with the playback speed, the playback speed of the animated image will be reduced. For example, the current frame will be played repeatedly to ensure that decoding is not performed on the main thread. This will cause the playback of the animated image to be uneven, but it is very good for page performance and will not cause lag.
DropFramesFrameScheduler: Play strictly according to the time standard of the picture. If the pre-decoding speed is too slow, decode directly in the main thread to ensure that the corresponding frame can be decoded and rendered to the screen within the corresponding time. The disadvantage is that decoding will be performed in the main thread, which may cause page jams.
Custom scheduler: The business customizes the getFrameNumberToRender interface to support special logic such as reverse playback and frame skipping playback.

Independent Cache

After investigating the frame drop problem of heif animation, it was found that heif animation adopted a new playback scheduling logic FixedSlidingHeifFrameScheduler: the animation does not have any pre-decoding logic. When the corresponding frame needs to be played, it is decoded directly in the main thread, that is, one frame is played and decoded one frame at a time. This also causes the Heif animation to occupy a lot of CPU resources in the main thread for decoding during playback.

Why must heif animations be decoded in the main thread?

Compared with other animated images that support arbitrary frame decoding, heif animated images use inter-frame compression and introduce the concept of I frame and P frame. I frame is a key frame that contains complete information about the current image and can be decoded independently. P frame is a difference frame that does not have complete picture data, but only data that is different from the previous frame. It cannot be decoded independently and decoding depends on the previous frame data.

Since the memory cache of AndroidBDFresco is LRU replacement, the Bitmap may be recycled at any time. Therefore, the decoding of Heif animations must be strictly carried out in the order of the animations, otherwise it will cause problems such as flowery screen and green screen during the playback of Heif animations.

Solution thinking:

Solve the problem from the source and optimize the encoding and decoding logic of heif animations. However, the current frame structure of Heif determines the decoding logic of the decoder. If you need to support decoding of specified frames, you have to modify the Heif encoding format, which is not feasible.
Instead of decoding in the main thread, a sub-thread is opened specifically to decode the heif animated image. When the main thread needs to render a certain frame, it switches to the sub-thread for decoding. When the decoding is completed, it notifies the main thread to do rendering. However, the solution has a significant impact on the decoding process of BDFresco and does not support memory caching. The solution is pending.
Tik Tok shares a decoder on both Android and iOS, but there is no frame rate degradation in the iOS experiment. The reason is that the image memory cache of iOS is controllable and there will be no unexpected cache release. Therefore, the Android side can try to learn from this idea.

A new memory cache block is opened separately for the heif animation, and a strong reference is made to the decoded Bitmap, which means that the content will not be released passively and will not be replaced by other pictures LRU. The advantage of this solution is that it can perfectly reuse the old decoding logic and also supports sub-thread pre-decoding, which can be achieved by simply caching the Bitmap separately.
Since Bitmap is a strong reference and the cache block has no upper limit, the solution may have unlimited memory growth, so there needs to be an active release time to reduce memory usage and ensure that the decoding order is not affected. Therefore, we try to associate the detach method of the view. When the animated image control slides quickly, it will actively release the corresponding Bitmap on the invisible View.

After experiments, we finally adopted an independent cache solution, which achieved bandwidth benefits while maintaining no significant degradation in the frame rate of individual pages.

4.2 Scaling on Demand

background

The image loading process will eventually render the decoded bitmap on the control. When the bitmap size is larger than the control, it will not actually affect the user's perception. The pixel value of the image will not exceed the space occupied by the control. When the image size >> control size:

Causes a certain degree of bandwidth waste;
The image is too large, causing serious performance loss on the client side.
Different businesses cropped the same image without considering the image size fragmentation problem, resulting in a significant decrease in the veImageX-CDN cache hit rate and ultimately a surge in back-to-source costs.

Solution

When displaying an image, the corresponding bitmap and control size are reported. From the reported data, it can be seen that the image size of a large number of business requests is much larger than the control. Therefore, a general solution is needed. Under the premise of ensuring image quality, the client provides a set of control specifications to converge the image to a fixed size according to the control size, ensuring that the image size is basically consistent with the display control, while reducing the image fragmentation problem.

Double-column cover scenarios exist in multiple businesses such as personal pages, local pages, and recommendations. Here we take a double-column cover as an example:

income

Visual search scene file size - 83.39%, memory size - 66.57%
veImageX-CDN cache hit rate increased by +6.99%, and the number of back-to-origin requests decreased by -23.79%

5 Abnormal recovery

Although we have made a series of optimizations to the image loading process, due to the large number of images on Douyin itself, some businesses such as e-commerce and IM have high requirements for image clarity, and there are operations such as image zooming and long image display. The business will load super-large images directly into the memory. The memory of a single image can even be as high as 100M+. A large amount of memory will be requested regardless of the disk IO stage, memory decoding, or Bitmap copying process, which will eventually lead to freezes, ANR, and even OOM crashes. Therefore, a backup solution is needed to solve the frequent OOM problem of images and improve the reliability of image loading.

When the system memory reaches the ceiling, Tik Tok will relieve the pressure by releasing image memory: it monitors the alarm callback of the system memory, releases image memory caches of different sizes according to different levels, and reduces the probability of OOM and ANR. However, due to the existence of large images, a large number of OOM still exist.

OOM backup

Memory is a global indicator, and the cause of the exception cannot be directly determined through the OOM stack, because when OOM occurs, the memory may be at a high water level, and it is possible that a small object is requested and the exception is directly triggered. However, most of the top 5 stacks in the crash are related to the image stack, so it is reasonable to suspect that the frequent large memory requests for images in the App caused it.

Therefore, for the high-frequency image decoding and memory copy logic, a backup logic is added. When the code encounters OOM, it is actively caught and some memory is released by clearing the memory cache occupied by the image to reduce the memory level:

Clear two levels of memory cache, decoded memory cache + undecoded memory cache
Clear the animated preview frames cached in the access layer

The experimental results show that although some OOMs are converted into native crashes, the overall impact on users is greatly reduced, and the experiment is in line with expectations.

Summarize

Overall, after building full-link monitoring of images, Douyin has made a lot of optimizations to the image loading process based on data analysis.

Improved image loading speed and performance
Reduced the total cost of the image

From the perspective of benefits, it can be roughly divided into two aspects: cost optimization and client experience optimization. The cost benefit is mainly the reduction of image bandwidth costs, and the experience benefit is reflected in the daily activity and OOM indicators. As various optimization solutions are promoted to more business lines, the benefits are also increasing continuously.

This article briefly introduces Douyin's best practices, experience, and business benefits for image optimization based on BDFresco. Due to space limitations, this article omits details such as the exploration process and specific implementation, but still hopes to provide some inspiration or reference for colleagues in the industry. Currently, BDFresco has been integrated into the veImageX product of Volcano Engine and is open to the industry. If you want to experience the same image optimization capabilities of Douyin, you can apply for it on the official website of Volcano Engine veImageX.

Reference: Volcano Engine veImageX provides an end-to-end one-stop overall image solution, including image and material hosting, image processing and compression, distribution, client encoding and decoding, and image loading SDK full-link capabilities. Official website address: https://www.volcengine.com/product/imagex

<<: iOS 18 upgrade experience, there’s something new!

>>: Apple is also keeping up with the times on the front end! Did you know?