Cooling down high concurrency: Meituan’s high-performance, high-reliability four-layer load balancing MGW optimization practice

Cooling down high concurrency: Meituan’s high-performance, high-reliability four-layer load balancing MGW optimization practice

[51CTO.com original article] Load balancing is one of the core technologies for achieving local network access. Its function is to distribute the massive influx of traffic relatively evenly to N network nodes or servers that can complete the same task, to avoid uneven load and pressure. As the entrance of application traffic, load balancing directly affects the performance and reliability of the application.

Recently, the 15th 51CTO Technology Salon with the theme of "Tech Neo" was held in Beijing. Wang Wei, the product development manager of Meituan Cloud Load Balancing Gateway MGW, was invited to this salon. He has many years of front-line implementation experience in load balancing systems, and has led and promoted the technical selection and performance optimization of MGW (Meituan's four-layer load balancing). In this article, we can learn about the detailed optimization process of MGW, which carries tens of Gbps of traffic and tens of millions of concurrent connections for Meituan Dianping, and how to achieve high performance and high reliability.

The role and classification of load balancing

In the early stages of the Internet, business logic was simple and traffic was not large, so a single server could meet daily needs. With the development of the Internet, business traffic will explode, logic will become more and more complex, and the demand for reliability will gradually increase. At this time, multiple servers are needed to deal with the problems of a single server in terms of performance and single point, and to perform horizontal expansion of performance and disaster recovery. However, how can the client traffic smoothly access so many different servers is a problem.

For this problem, you can choose to use DNS as a load balancing method to resolve different IP addresses of the client so that the traffic can reach different application servers directly. However, this scheduling strategy is relatively simple but cannot meet business needs very well. When the scheduling plan is changed, the cache of DNS nodes at all levels cannot take effect on the client in time, which will cause serious delay problems. At this time, you can choose to use load balancing, as shown in the following figure:

As shown in the figure, the client traffic first reaches the load balancing server, and then the load balancing server uses some scheduling algorithms to distribute the traffic to different application servers. At the same time, the load balancing server performs periodic health checks on the application servers. Once a faulty node is found, it is directly removed to ensure high availability of the application.

Load balancing is divided into four layers and seven layers , as shown below:

The core of layer 4 load balancing is forwarding. After receiving client traffic, the data packet address information is modified and forwarded to the application server. The entire process is completed at the transport layer of the OSI model.

The core of the seven-layer complex balance is the proxy. After receiving the client's traffic, the application layer traffic is parsed through the complete TCP/IP protocol stack. After that, the appropriate application server is selected based on the scheduling algorithm, and a connection is established with the application server to send the request. The entire process is completed at the application layer of the OSI model.

Layer 4 load balancing forwarding mode

Forwarding is the main work of layer 4 load balancing. Currently, there are four main forwarding modes: DR mode, NAT mode, TUNNEL mode, and FULLNAT mode . The following is a comparison chart of the advantages and disadvantages of the four forwarding modes:

The figure shows the advantages and disadvantages of the four forwarding modes. The implementation method of DR mode (triangular transmission) is to modify the destination MAC address of the data packet; the implementation method of NAT mode is to modify the destination IP address of the data packet; the implementation method of TUNNEL mode is the same as DR, but the application server must support the TUNNEL function.

Meituan Dianping finally chose FULLNAT as the forwarding mode of MGW, which performs a source address translation (SNAT) based on the NAT mode. The advantage of this mode is that it allows the response traffic to return to the load balancer through normal three-layer routing, so the load balancer does not need to exist in the network in the form of network management, thereby reducing the requirements for the network environment. The disadvantage is that when doing SNAT, the application server will lose the real IP address of the client.

The following figure shows the specific implementation process of FULLNAT forwarding mode :

As shown in the figure, a localip pool is deployed on the load balancer. During the SNAT process, the source IP is taken from the pool. When the client traffic reaches the load balancer, the load balancer selects an application server in the application server pool according to the scheduling strategy and changes the destination IP of the data packet to the application server IP. At the same time, a localip is selected from the localip pool to change the source IP of the data packet to the localip. In this way, when the application server responds, the destination IP is the localip, and the localip is the real IP address on the load balancer, so it can reach the load balancer through normal three-layer routing.

Compared with NAT mode, FULLNAT mode will perform SNAT once more, and there is also a port selection operation in the process. Its performance is slightly weaker than NAT mode, but FULLNAT mode has an advantage in network adaptability.

Reasons for using high-performance, high-reliability four-layer load balancing

Why did Meituan Dianping choose its own four-layer load balancing? There are two main reasons:

  • As business traffic increased, LVS encountered performance bottlenecks and increased operation and maintenance costs.
  • When using hardware load balancing, cost is the threshold.

In terms of LVS, Meituan Dianping initially used LVS for Layer 4 load balancing and Nginx for Layer 7 load balancing. As shown in the figure above, with the rapid growth of traffic, LVS encountered performance bottlenecks and the failure rate increased.

In terms of hardware load balancing, three issues stand out: abnormally high hardware costs, labor costs, and time costs.

Based on these two reasons, Meituan-Dianping decided to develop its own four-layer load balancing MGW to replace LVS. MGW needs to meet the two characteristics of high performance and high reliability. So how to achieve high performance and high reliability?

Four major problems and solutions for achieving high performance in MGW

By comparing some performance bottlenecks of LVS, we can intuitively understand the solution of MGW to achieve high performance. This mainly involves the four major problems encountered by LVS: interruption, long protocol stack path, lock and context switching. MGW's solutions to these problems are PMD driver, Kernel bypass, lock-free design and CPU binding and isolation.

Interrupt

LVS is implemented in the kernel, and the kernel is designed to be universal, so it uses interrupts to sense traffic reception. Interrupts have a significant impact on LVS performance. If 5 million packets are processed per second, and a hardware interrupt is generated for every 5 packets, then 1 million hardware interrupts will be generated per second. Each hardware interrupt will interrupt the load balancing program that is performing intensive calculations, resulting in a large number of cache misses, which has a significant impact on performance.

Too long protocol stack path

Because LVS is an application developed based on kernel netfilter, which is a hook framework running on the kernel protocol stack. When a data packet reaches LVS, it has to go through a long protocol stack process. In reality, after receiving the traffic, the load balancer only needs to modify the MAC, IP, and port before sending it out, and does not need this process to increase extra consumption.

The solution to the two problems of interruption and long protocol stack path is to use the user-mode PMD driver provided by DPDK and perform Kernelbypass . During the design process, DPDK used a large number of hardware features, such as numa, memory channel, DDIO, etc., which greatly improved performance. At the same time, it provides network libraries to reduce development difficulty and improve development efficiency. These are the factors that Meituan Dianping chose DPDK as the development framework of MGW.

Lock

The kernel is a more general application, and some locks are added during the design process to take into account different hardware. However, independent research and development only requires customized design for certain specific scenarios, without taking into account a lot of hardware, and then removes the locks through some methods to achieve a completely lock-free state, which is convenient for subsequent expansion.

Let's first look at the situations where lock design is needed . First, let's introduce RSS (Receive Side Scaling). RSS is a function that hashes data packets to different network card queues through the tuple information of the data packet. At this time, different CPUs go to the corresponding network card queues to read data for processing, which can fully utilize CPU resources.

MGW uses FULLNAT mode, which will change all the tuple information of the data packet. In this way, the data packets in the request and response directions of the same connection may be hashed by RSS to different network card queues. Different network card queues mean that they are processed by different CPUs. At this time, when accessing the session structure, this structure needs to be locked for protection.

To solve this problem, there are generally two solutions:

  • When doing SNAT port selection, a port lport0 is selected to make RSS(cip0, cport0, vip0, vport0) = RSS(dip0, dport0, lip0, lport0) equal.
  • Assign a localip to each CPU. When SNAT selects IP, different CPUs select their own localip. When the response comes back, the data packet of the specified destination IP is sent to the specified queue through the mapping relationship between LIP and CPU.

Since the second method is supported by the flow director feature of the network card, MCW chooses the second method to remove the lock of the session structure.

Flow director can send the specified data packet to the specified network card queue according to a certain strategy. Its priority in the network card is higher than RSS. Therefore, a localip is assigned to each CPU during initialization:

  • For example, lip0 is assigned to cpu0, lip1 is assigned to cpu1, lip2 is assigned to cpu2, and lip3 is assigned to cpu3. When a request packet (cip0, cport0, vip0, vport0) reaches the load balancing, it is hashed to queue 0 by RSS, and then the packet is processed by cpu0.
  • When cpu0 performs fullnat on it, it selects cpu0's own localip lip0, and then sends the data packet (lip0, lport0, dip0, dport0) to the application server. After the application server responds, the response data packet (dip0, dport0, lip0, lport0) is sent to the load balancing server.

In this way, a rule can be added to the flow director to send packets with destination IP lip0 to queue 0, so that the reply packets will be sent to queue 0 for processing by cpu0. At this time, when the CPU processes packets in both directions of the same connection, it is a completely serial operation, and there is no need to lock the session structure for protection.

Context Switching

To address the context switching issue, MGW is designed to group CPUs to achieve complete separation of the control plane and the data plane, which can effectively avoid interruptions in the data plane during processing, as shown in the following figure:

At the same time, the data plane CPU is isolated so that the control plane process does not schedule the data plane CPU; the data plane threads are bound to the CPU so that each data thread occupies a CPU exclusively. All control plane programs run on the control plane CPU, such as the Linux kernel and SSH.

MGW achieves high reliability solution

Here we mainly introduce the methods of achieving high reliability at three layers: MGW cluster, MGW stand-alone and application server .

MGW Cluster

The high reliability of the cluster mainly involves three parts: session synchronization, fault switching, fault recovery and capacity expansion . As shown in the figure below, MGW uses the ECMP+OSPF mode to form a cluster :

The role of ECMP is to hash data packets to each node in the cluster, and then use OSPF to ensure that the route of this machine is dynamically removed after a single machine fails. In this way, ECMP will no longer distribute traffic to this machine, thus achieving dynamic failover.

Session synchronization

Traditional ECMP has a serious problem in terms of algorithms. When the number of nodes in a complex balanced cluster changes, the path of most traffic will change. When the changed traffic reaches other MGW nodes, it cannot find its own session structure, which will cause a large number of connections to become abnormal, greatly affecting the business. When the cluster is upgraded, each node will be taken offline, which aggravates the impact of this problem.

The traditional solution is to use a switch that supports consistent hashing , as shown in the following figure.

The traditional solution is to use switches that support consistent hashing. When a node changes, only the connection on the changed node will be affected, and other connections will remain normal. However, there are relatively few switches that support this algorithm, and high availability is not fully achieved. Therefore, session synchronization between clusters is required, as shown in the following figure:

Each node in the cluster will fully synchronize its own session, so that each node in the cluster maintains a global session table. In this way, when the node changes, no matter what form of change occurs in the traffic path, the traffic can find its own session structure.

Next, failover, failover recovery, and capacity expansion are the two major issues that need to be considered in the entire process of session synchronization between clusters.

Failover

In response to the problem of fault switching, when a machine fails, the switch can immediately switch the traffic to other machines to avoid a large number of packet losses. After investigation and testing, MGW uses the operation method shown in the figure below, which can achieve 0 packet loss in upgrade operations and 0 packet loss in main program failures. Other abnormalities (such as network cables) will have a maximum packet loss of 500ms, because such abnormalities need to be detected by the self-test program, and the cycle of the self-test program is 500ms.

It specifically includes the following four aspects:

The switch side does not use virtual interfaces. All physical interfaces are used and when the server side powers off the interface, the switch will instantly switch the traffic to other machines. The test of sending two packets in 100ms (one each on the client and server) shows that this operation method can achieve zero packet loss.

The machine performs a health self-check every half a second. Fault switching mainly depends on the perception of the switch. When an abnormality occurs on the server and the switch cannot perceive it, the switch cannot perform a fault switching operation. Therefore, a health self-check program is required. A health self-check is performed every half a second. When an abnormality is found in the server, the network port of the server is powered off, thereby immediately cutting off the traffic.

When a fault is detected, the network port is automatically powered off. Failure switching mainly depends on the power off operation of the network port and the network card driver runs in the main program. When the main program hangs up, the network port can no longer be powered off.

The main process will capture abnormal signals. In order to solve the problem that the network port can no longer be powered off after the main program crashes, the main process will capture abnormal signals and power off the network card when an abnormality is found. After the power off operation is completed, the signal will continue to be sent to the system for processing.

Fault recovery and capacity expansion

When the system performs fault recovery and capacity expansion operations, the number of cluster nodes will change, which will further cause changes in the traffic path.

Because when the changed traffic reaches the original nodes in the cluster, the original nodes maintain a global session table, so the traffic can be forwarded normally; but when the traffic reaches a new machine, this new machine does not have a global session table, so this part of the traffic will be discarded.

To address this issue, MGW will go through a pre-online intermediate state after going online. In this state, MGW will not let the switch know that it is online, so the switch will not switch the traffic to it.

The implementation process is that first, MGW will send a batch synchronization request to other nodes in the cluster. After receiving the request, other nodes will synchronize their entire session to the newly online node. The newly online node will let the switch know that it is online only after receiving all sessions. At this time, the switch will cut in the traffic and it can be forwarded normally.

Two points need to be reminded here :

First, since there is no master node in the cluster to maintain a global state, if the request report data is lost or the session synchronization data is lost, the newly launched node will not be able to maintain a global session state. However, considering that all nodes maintain a global session table, the number of sessions owned by all nodes is the same. Therefore, a finish message can be sent after each batch synchronization of all nodes. The finish message carries the number of sessions owned by the node. When the newly launched node receives the finish message, it will compare its own number of sessions with the number in the finish message. When the number requirement is met, the newly launched node will control itself to go online. Otherwise, after waiting for a certain timeout period, the batch synchronization operation will be performed again until the requirement is met.

Second, when performing batch synchronization, if a new connection appears, the new connection will not be synchronized to the newly launched machine through batch synchronization. If there are too many new connections, the newly launched machine will not meet the requirements. Therefore, it is necessary to ensure that the machine in the pre-online state can receive incremental synchronization data, because the new connection can be synchronized through incremental synchronization. Through incremental synchronization and batch synchronization, it can be ensured that the newly launched machine can eventually obtain a global session table.

MGW stand-alone

Automated testing platform

In terms of single-machine high reliability, MGW has developed an automated testing platform that determines whether a test case is successfully executed based on connectivity and configuration correctness. The platform can notify testers of failed test cases via email. After each new function iteration, the test case of the new function will be added to the automated platform, so that an automated test is performed before each launch, which can greatly avoid problems caused by changes.

Before this, manual regression testing was required before each launch. Regression testing was very time-consuming and it was easy to miss use cases, but it had to be done to avoid new problems caused by changes. With the automated testing platform, the efficiency and reliability of regression testing can be improved.

Application Server

In terms of application server reliability, MGW is equipped with two major functions: node smooth offline and consistent source IP Hash scheduler.

Node smooth offline

The smooth offline function of nodes is mainly used to solve the problem that when users need to upgrade RS, if the RS to be upgraded is directly offline, all connections on this RS will fail, affecting the business. At this time, if the smooth offline function of MGW is called, MGW can ensure that the existing connections of this RS work normally, but will not schedule new connections to it.

When all existing connections are terminated, MGW will report an end status, and users can upgrade RS according to this end status, and then call the online interface to allow the RS to provide normal services. If the user platform supports automated application deployment, you can use the smooth offline function by accessing the cloud platform to achieve fully automated upgrade operations without affecting the business.

Consistent Source IP Hash Scheduler

The source IP Hash scheduler is mainly used to ensure that the connections of the same client are scheduled to the same application server, that is, to establish a one-to-one mapping relationship between the client and the application server. The ordinary source IP Hash scheduler will cause the mapping relationship to change after the application server changes, which will affect the business.

The consistent source IP Hash scheduler ensures that when the application server cluster changes, only the mapping relationship between the changed application server and the client changes, and the others remain unchanged.

To ensure traffic balance, a fixed number of virtual nodes are first allocated on the hash ring, and then the virtual machine nodes are evenly redistributed to the physical nodes. The redistribution algorithm needs to ensure:

  1. When the physical nodes change, only a few virtual node mapping relationships change, which is to ensure the basic principle of consistent hashing.
  2. Because MGW exists in the form of a cluster, when multiple application servers go online or offline, there may be an inconsistency in the order of feedback to different MGW nodes. Therefore, no matter what order the application servers go online and offline in different MGW nodes, the final mapping relationship must be consistent. If it is inconsistent, the connection of the same client will be scheduled to different application servers by different MGW nodes, which violates the principle of the source IP Hash scheduler.

Technology Outlook

In the future, Meituan Dianping will focus on the following three aspects: upgrading automation, centralized configuration management and reducing operation and maintenance costs.

Upgrade automation. The original purpose of self-developed MGW was to solve the performance problems of LVS. After these problems were solved, with the rapid development of business, the IDC continued to grow, and the number of load clusters increased. For example, when a new IDC was launched, two clusters were generally launched, one for the IDC entrance and the other for internal business.

In this case, a bigger problem emerges. For example, when a new function is released, the cycle will be very long, so automatic upgrades are needed. Of course, the monitoring measures should also be improved to monitor anomalies.

Centralized configuration management. Currently, business configuration has been centralized, and we hope that machine configuration can also be centralized in the future.

Lower operation and maintenance costs. The fundamental purpose of achieving automated upgrades and centralized configuration is to reduce operation and maintenance costs.

【Guest Introduction】

[[207591]]

Wang Wei, a technical expert at Meituan Cloud, joined Meituan Cloud in 2015 and is currently the product development manager for Meituan Cloud's load balancing gateway MGW. With many years of frontline experience in load balancing system implementation, he has led and promoted the technical selection and performance optimization of MGW.

Other articles from Salon

A thorough understanding of CDN pain points, Internet veterans talk about CDN!

Hundreds of millions of video views per day, Miaopai playback link optimization practice

October 28 / Beijing, the 16th "Tech Neo" salon, theme: "Automated Operation and Maintenance and DevOps", click on the picture to register immediately.

[51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites]

<<:  How to use new Java 8 features in Android N Preview

>>:  Want to quickly acquire customers? Here are 2 ways to promote your app to others

Recommend

How to evaluate, monitor and promote KOL marketing channel conversion?

With the rapid development of the Internet , we h...

Why are baked sweet potatoes tastier than steamed ones?

As autumn approaches, there is always an irresist...

JDI and JOLED to merge in last-ditch effort to promote Japan Display

In the 1990s, Japan was the leader in the global ...

Talking about IP operation from the perspective of Perfect Diary’s marketing

Nowadays, with the continuous development of bran...

4 tips to teach you how to operate WeChat groups and QQ groups!

Few people know the purpose of group operations ,...

New Wi-Fi technology is available that's better suited for smart homes

[[161680]] 1. New WiFi technology is emerging, mo...

Xiaohongshu product analysis report!

This article will explain the analysis of Xiaohon...

[Crypto] Gu Yu is proficient in volume price analysis 50 episodes of video

[Encryption] Gu Yu is proficient in volume price ...

WeChat finally took action against primary school students

Is a children's version of WeChat coming? Rec...