Technical challenges and practical summary behind the hundreds of billions of visits to WeChat Moments

Technical challenges and practical summary behind the hundreds of billions of visits to WeChat Moments

1. Introduction

WeChat Moments consists of two business architectures: pictures and videos. The characteristics of pictures in Moments are large number of requests and high consumption of computing resources, while videos mainly consume bandwidth.

The data in the circle of friends is stored permanently, and with the rapid development of the business, the consumption of storage capacity, bandwidth and equipment has increased significantly. The increase in usage during major holidays has exacerbated the consumption and brought tremendous pressure to the operation and maintenance personnel.

During the holidays, technical support mainly consists of three aspects:

  • 1) Software assurance refers to reducing the load through optimization and evaluation at the program and business logic level;
  • 2) Hardware assurance mainly refers to the assessment and expansion of bandwidth and machine load;
  • 3) Flexible measures refer to reducing the resources of some unimportant features through business adjustments to ensure the normal operation of key features.

2. Software architecture assurance

The overall situation of the circle of friends is shown in the figure below:

The architecture of the circle of friends is mainly divided into two types: OC and IDC:

  • IDC refers to the data center, which is where the data is finally stored;
  • OC refers to an independent computer room with external network, and SOC refers to a larger OC.

Each IDC has a complete set of interface machines/logical devices/storage devices to support users' upload, download, and file storage needs.

The main function of the OC point is to provide external network access and carry the user's download traffic. The devices in each OC together form a cache pool. When the user downloads, if the cache in the local OC is not complete, it will go back to the IDC to pull the file. The functions of each OC are the same. Users generally download from the nearest OC point. When a single OC point fails, it will retry or switch to allow users to download from other OC points to ensure a successful download.

3. Disaster recovery and retry mechanism

The module disaster recovery of the circle of friends is mainly to realize the automatic elimination of single machine failure. The main form is to find abnormal devices through the IP list of the master management server through heartbeat detection and other methods, and block the faulty IP and not return it to the front-end for use.

Take the single machine culling of the front layer as an example:

If the entire OC or IDC point encounters a failure, due to the large changes, it is generally necessary to rely on manual switching by operation and maintenance personnel to recover, or to ensure it through a retry mechanism between modules.

Retry downloading from Moments:

Whether it is the download process from the user to OC or the return process from OC to IDC, by default, 2 retries will be performed after failure, and the retry will definitely select a remote access point to avoid further retries to the failed node. The implementation principle is that each layer of the master will return at least two groups of IP lists to the front end, and ensure that the two groups of IP lists are remote nodes, so that remote retries can be implemented when the front end fails.

However, retries are a double-edged sword because they cause an increase in requests. During the holiday peak period, the number of requests has increased significantly, so retries are more likely to cause problems and need to be adjusted:

  • 1) Send it down through the master route and turn off retry. This is done during festivals such as New Year's Day and Spring Festival when the number of requests increases several times;
  • 2) The on-duty personnel closely monitor and if the IDC failure rate exceeds 20%, they will be shut down manually and retried. This is done during festivals such as the Mid-Autumn Festival and National Day when growth is not high.

Retry control interface of the Front module:

4. Hardware support

4.1 Capacity Assessment and Equipment Expansion

Before important holidays, the operation and maintenance personnel will work with the resource group to expand the equipment capacity of each computer room and module according to the business budget, business growth needs and actual load. If the request increases beyond the budget, they will be reduced or rejected through flexible or overload methods.

Evaluation Methodology:

  • 1) The capacity of the computer room is mainly evaluated based on the upper limit of the switch bandwidth;
  • 2) The capacity of the access layer equipment is mainly evaluated based on the CPU, memory load ratio, and network card traffic/packet ratio;
  • 3) The capacity of the storage layer device is mainly evaluated based on the CPU, memory load ratio, and disk IO read and write times.

4.2 Spring Festival Moments upload load

The growth ratio required by the business side during the Spring Festival is 9 times growth for uploads and 1 times growth for downloads. Requests exceeding this ratio can be rejected. However, after the budget is expanded to achieve the effect shown in the above figure, some modules cannot support this increase, especially the compression module. Every doubling of growth required by this module requires a large number of virtual machines to be expanded, which cannot be supported within the budget. In this case, a flexible strategy needs to be used to solve the problem.

5. Introduction to flexible strategy

The flexible strategy of the circle of friends is divided into two layers:

  • The first layer is rough flexibility: that is, directly limiting the upload and download requests according to the proportion and business. The restricted requests will be returned to the user as failure, which is the same as WeChat C2C. This is generally used to quickly restore business when the load capacity exceeds the system's estimated capacity and causes system failure;
  • The second layer is flexibility based on business characteristics: that is, reducing the system load from the business level by reducing the clarity of pictures and videos, delaying user updates, etc. The following mainly describes business flexibility in detail.

The main growth and bottlenecks of the circle of friends business: From the equipment load assessment diagram in the previous article, within the budget, the access layer and the logic layer can only support a 5-fold increase, while the compression module can only support a 1-fold increase.

6. Flexible practice: compress flexibility

The function of the Compress module is to compress the original images uploaded by the client into various formats and sizes as required to support specific business scenarios and save storage space and bandwidth. Due to the continuous development of compression technology, more advanced compression formats are used. The higher the compression ratio of images of the same clarity, the more compression computing resources are consumed.

So if we do the reverse operation and replace the currently used hevc format with jpeg format, we can save compression resources. The actual measured CPU load of compress can be reduced to 20%, which means it supports a 5-fold increase. However, the average size of the image will also increase, resulting in an increase in download traffic.

Therefore, the compromise method adopted is to reduce the image resolution from 70 to 50 when uploading the image back to JPEG format, which can reduce the average file size and offset the traffic increase caused by switching back to JPEG format. In actual tests, it is found that users do not have a significant perception of the reduction in resolution, and turning it on briefly during holidays will not affect the user experience.

7. Flexible practice: flexible bit rate of small videos

The bandwidth of short videos usually exceeds 1TB, and the holiday effect increases significantly. The method used to reduce traffic is similar to that used for pictures, that is, to reduce the bit rate of uploaded videos and save bandwidth by reducing the average file size.

Flexible: small video bitrate 1800 -> 1200 average size 2.1MB -> 1.3MB

After testing, the bitrate reduction basically does not affect the user experience, but because it is effective for newly uploaded videos, there is a considerable delay before it is reflected in the reduction of download bandwidth, and it takes about 4 hours to take full effect. Therefore, this flexible measure needs to be enabled before the holidays and cannot be used to deal with emergencies.

Traffic changes during the period of bit rate reduction:

8. Flexible practice: Upload TSSD buffer pool flexibility

Since the upload preupload interface machine and the back-layer logic modules cannot support a 10-fold increase, two sets of TSSD buffer pools are built in the architecture. The buffer pool is used to temporarily store newly uploaded files and can support reading and writing. As shown in the above figure, buffer pool 1 is added to the zone module, and buffer pool 2 is added to the upload preupload. The functions of the two buffer pools are different:

If the zone module is overloaded, the upload request that is actively overloaded will not return a failure directly, but will write the request to buffer pool 1. The files in buffer pool 1 cannot be downloaded, but will be sent down at a slower speed and written to the backend module. Therefore, the main function of buffer pool 1 is to slow down a large number of upload requests in a short period of time, rather than completely offsetting the upload requests, and the files in buffer pool 1 cannot be downloaded.

Buffer pool 2 is added to the preupload module. The number of write requests for storage TFS is limited in the preupload module. If the number of upload requests exceeds the capacity of storage TFS, preupload will write the request to buffer pool 2. When the user downloads, it will be judged based on the file identifier. If it is found that the file is stored in buffer pool 2 instead of TFS, it will go to buffer pool 2 to get the file. Therefore, buffer pool 2 can replace the function of TFS and play a role in protecting the underlying module. When buffer pool 2 is removed from the shelf, the files in it need to be manually written to TFS.

9. Flexible practice: Moments timeline is flexible in proportion

Timeline refers to the timestamp of WeChat Moments updates. The principle of this flexibility is to cache the timestamp of notifying users of friend Moments updates first, and not send it to the user's WeChat terminal. In this way, the updated content of Moments cannot be seen on WeChat, and no request to download pictures/videos will be generated, which can directly reduce download traffic.

After the timeline is flexible, it will not be updated here:

But there are a few things to note:

  • 1) It is easy to cause user complaints. Users generally feel that there is less content in their Moments;
  • 2) If the cache timeline is too long, the process of sending the cache must be very slow, otherwise it will cause a further surge in download traffic.

<<:  13 aspects to consider when developing a mobile app

>>:  KaiOS surpasses iOS to become the second largest mobile operating system in India

Recommend

What surprising technologies did Google kill off in 2015?

In today's article, we will take a look at 15...

Activity operation sandbox/activation engine

We have talked about so many recommendation algor...

Sharing practical experience on WeChat group fission!

Starting last year, the term "private domain...