The technical secrets behind Alipay's Double 11 sales over the years

The technical secrets behind Alipay's Double 11 sales over the years

Since the launch of the e-commerce festival Double 11, the life trajectories of many technical people have changed. This classic case of annual high-concurrency, high-traffic and complex business scenarios has posed various challenges to technical and product people. Today we will take a look at the development history of Alipay Double 11.

Just like the past 10 years, Tmall Double 11 in 2019 set a new record. Behind this number are generations of Alipay engineers who have worked hard and continuously overcome technical difficulties. Today, Alipay colleagues provided MacTalk with a documentary "One Heart, One Battle", which is a record of 11 technical colleagues who experienced Double 11, telling the secret story of Alipay's technological development along the way. Many things are disclosed for the first time. I was very excited to watch it and recommend it to everyone.

For technical personnel, it is certainly not easy to maintain a stable and smooth operation 24 hours a day during Double 11, but the most challenging moment is just after midnight, when people pick up their mobile phones, refresh their already saved shopping carts, and click to pay.

In 2011, what unknown technical explorations did Alipay have behind the increasingly smooth Double 11 shopping at midnight? How did they manage to support such a huge amount of transaction data? Let's take a look at the first published text version.

1. Starting from the external bottleneck

Things didn't seem to be going well from the beginning.

During the peak period of the 2011 Double 11 shopping festival, a small number of users were unable to pay. After investigation, it was found that this was because the online banking systems of a few banks failed under pressure. In the early days of Alipay transactions, users had to pay through the interface between Alipay and the bank after clicking on the payment button. However, the performance of this interface was very poor in the early years, supporting only dozens to hundreds of transactions per second, and the stability was also relatively poor. Once the traffic increased, it was prone to failure.

If this problem is not solved, payment will not be possible during every major promotion in the future, which will greatly affect the user experience. However, this problem is difficult to solve with technology alone. Banks have their own plans for the evolution of online banking systems, and Alipay cannot interfere with their systems.

However, the clever operators came up with a workaround. During the Double 11 Shopping Festival in 2012, Alipay offered an activity to attract users to top up first and pay later. Users were asked to top up their Alipay balance first, and then the balance would be deducted directly on Double 11 Shopping Festival. In this way, the external bottleneck was transferred to the internal one. The effect of this was very significant, and the problem of payment failure was greatly alleviated.

However, external bottlenecks always exist. Faced with the traffic peak doubling every year, the external dependence of payment is always a hidden danger, and it is unknown when it will explode.

The best way to solve this problem is to let the funds circulate in the internal system without going through online banking. This is the principle of recharging first and paying later. So, is there a way to attract users to put their money in Alipay? In June 2013, Alipay launched Yu'ebao, which solved this problem by accident. By the end of 2014, Yu'ebao had attracted 185 million users. During the Double Eleven in 2013 and 2014, the transaction peaks increased by 4 times and 3 times respectively.

In May 2018, Alipay was connected to the China UnionPay clearing platform. At the same time, in recent years, banks have also been vigorously improving their system capabilities. The number of transactions supported by the online banking systems of medium and large banks has reached more than 20,000 transactions per second, and external problems have been basically solved.

After resolving the external bottlenecks, how high the payment peak figure can be depends on how Alipay's system resolves the traffic peaks that are becoming more ferocious year by year.

2. Capacity planning: Food and fodder are ready before the troops move

In fact, the primary problem in supporting the peak number of transactions is not to design an architecture that perfectly supports horizontal expansion, but to accurately estimate the possible traffic peak and then arrange the corresponding machines and resources. If no estimate is made, two situations may occur: too many resources are prepared, the architecture is over-designed, resulting in resource waste; too few resources are prepared, and the promotion cannot be perfectly supported, causing some payments to queue or fail. In preparation for Double Eleven every year, the decision-making team responsible for the promotion will formulate a transaction value based on historical data and promotion goals, and then break this value down into the traffic that each system needs to handle, so as to plan the system capacity.

The scenario indicators of the Double 11 promotion generally include the number of transactions created, the number of cashier displays, and the number of transaction payments. The total payment target number is already known, and the operation and maintenance personnel calculate the single-machine capacity of the application under each indicator based on the total tps/single-machine tps algorithm. Then, referring to the historical activity data, the single-machine tps of the application under different scenario links can be calculated.

However, this approach involves a lot of manual intervention, and the granularity of capacity estimates for each application is relatively coarse. Later, Alipay built a capacity analysis platform that can perform automated and fine-grained capacity analysis.

The principle is that if we understand a link as a business, the root node of the link can be understood as the source traffic request of the business, and each node on the link (the nodes here include applications, DBs, tailors, etc.) can calculate the coefficient of the number of node calls relative to the root node traffic. Therefore, when the QPS of the business source is determined, the QPS of each node can be calculated based on the link data.

During the Double Eleven shopping festival in 2018, Alipay also built an intelligent capacity model that can not only estimate capacity based on business traffic, but also intelligently generate application resource deployment plans, so that under this plan, the capacity level of the deployment unit when carrying a given business traffic is within the target range.

The intelligent capacity model is part of Alipay's exploration of AIOps, and is also part of the implementation of data technology and artificial intelligence in the system. This aspect is also one of the current directions of Alipay's technological exploration.

3. LDC and flexible architecture: the most powerful weapon for big promotions

After estimating the traffic and making reasonable capacity planning, the next step is to see whether our architecture can support the traffic peak.

First of all, it needs to be explained that traffic peaks involve all aspects of a system. The entire Alipay system is extremely complex, and it has launched many businesses for both toC and toB. Even if we only focus on the core payment system, it also includes subsystems such as payment clearing, accounting, and accounting.

Some components of the system are supported by general-purpose middleware, such as the load balancing middleware LVS/Spanner, Alibaba's distributed cache middleware Tair, etc., while the rest are supported by Alipay's self-developed SOFAStack financial-grade distributed middleware.

The essence of payment peak is a high concurrency problem. Internet companies solve high concurrency by horizontal expansion and horizontal splitting, and use a distributed approach to deal with traffic peaks. Alipay is no exception. Alipay completed the horizontal split of the service-oriented architecture and core database very early, and successfully dealt with the Double Eleven in previous years.

Distributed system diagram

The problem with this architecture is that all sub-applications need to access all database sub-databases, but database connections are limited. At that time, the connections of mainstream commercial databases were not shared, which means that one transaction must occupy a connection exclusively. Connections are very valuable resources for databases and cannot be increased indefinitely. At that time, Alipay was facing the problem of not being able to expand the application cluster, because each additional machine required several new connections to each database sub-database, and the number of connections to several core databases had reached the upper limit. The inability to expand the application meant that the capacity of the Alipay system was fixed and there could be no more business growth. Not to mention the big promotion, it was likely that even daily business would not be able to be supported after a while.

This problem was imminent. Since 2013, Alipay started a new round of architectural transformation and implemented a unitized LDC logical data center. Finally, it successfully withstood the traffic peak of Double Eleven.

A unit is a fully-equipped, miniaturized version of the entire site. It is omnipotent because all applications are deployed on it, but it is not full-scale because it can only operate on a portion of the data. In this way, as long as the data is partitioned and units are added, the processing performance limit of the entire system can be increased.

Unitization diagram

However, not all data can be split. For example, some underlying data is global data that all unit applications need to access. Moreover, after nearly ten years of construction, some architectures of Alipay cannot be split into units. Under this premise, Alipay designed the unitized architecture of CRG, which can not only take advantage of the advantages of unitization, but also support the existing architecture.

1. RZone (Region Zone): The zone that best meets the theoretical unit definition. Each RZone is self-contained, has its own data, and can complete all services.

2. GZone (Global Zone): Deploys inseparable data and services that may be relied upon by RZone. There is only one group of GZones in the world, and only one copy of the data.

3. CZone (City Zone): It also deploys inseparable data and services and will be relied upon by RZone. Unlike GZone, the data or services in CZone will be frequently accessed by RZone, and each business will be accessed at least once; while the frequency of access to GZone by RZone is much lower. CZone is specially designed to solve the problem of long-distance latency.

CRG architecture diagram

For more information about Alipay unitization and LDC, see this article .

After the implementation of LDC, the system capacity was expanded horizontally, successfully supporting the Double Eleven traffic peak in 2013 and later. The system was no longer limited by single point failures, and after improvements, it was able to achieve multi-site activeness, ultimately forming a financial-grade architecture with three locations and five centers.

Theoretically, as long as the computing resources of LDC are infinitely expanded, unlimited traffic can be handled. However, if this is done, most machines can only be used during major promotions and are idle at other times, resulting in a waste of resources. It is best to use a small amount of resources to support regular traffic at other times, and during major promotions, after capacity planning, enable some idle or third-party resources in advance to handle peak traffic. This is the origin of elastic architecture.

In 2016, Alipay began to transform its elastic architecture for big sales. The elastic architecture is based on business links. Because only some links experience a surge in traffic during big sales, it is only necessary to elastically expand the capacity of key links for big sales.

The elastic architecture involves transformation at multiple levels, starting with the elastic computer room and elastic unit. It is necessary to continue slicing the LDC logical computer room architecture according to the business dimension to ensure that a single-chip business can be deployed in an independent logical unit and maintain connectivity with the non-elastic unit, and can be popped up and recovered at any time.

The second is elastic storage, including the elasticity of flow-type data and state-type data. Flow-type data includes payment orders. In order to support the elasticity of these data, elastic bits + elastic UIDs are created, and then the routing assigns orders to elastic units for processing based on the elastic UIDs. State-type storage, such as user account balances, is ejected as a whole. The specific implementation method is to switch between the master and slave databases at the DB layer to divert the pressure of the master database to the slave database.

Then there is the transformation at the middleware level, including routing, RPC, message queues, traffic management, etc. The application level also needs to be transformed accordingly, because each elastic unit needs to be deployed as an independent logical unit, so it is necessary to sort out and separate from services to data, and add elastic logic processing such as elastic ID.

In addition to these, the operation and maintenance platform and stress testing tools also need to be modified accordingly.

After the elastic architecture was launched in 2016, it successfully supported the Double Eleven shopping festival that year, met the promotion requirements and predetermined goals, saved physical resources in the computer room, and became the most powerful weapon to deal with traffic peaks related to promotions.

The elastic units in the elastic architecture are all newly added clusters, but in fact, resource utilization can be further improved. The method is offline and online hybrid deployment technology, because some clusters are used for offline big data analysis, but they do not work at full capacity 24 hours a day. When there are no tasks, the cluster resource utilization is extremely low. If offline applications and online business applications are deployed together, so that these resources can be used during the peak period of the promotion, the resources purchased during the promotion can be reduced, further saving costs. Hybrid deployment technology requires the coordination of time-sharing scheduling of operation and maintenance to allocate resources to different applications at different time periods.

Since 2017, Alipay has begun to experiment with offline and online co-location and time-sharing scheduling technologies, using cluster resources used by offline technologies during major promotions, greatly improving cluster resource utilization.

4. Million Payments: Solving the Database Expansion Bottleneck

During the Double 11 shopping festival in 2016, the number of transactions peaked at 120,000 per second, and the battle for high concurrency is still going on. We have mentioned many technical means to deal with big promotions, but we have missed one of the most important parts, which is the database. During peak traffic, the database is under the greatest pressure. This is because, at the front end, we see a successful transaction, but after breaking it down, a transaction may generate hundreds or even thousands of requests on average, and the pressure on the database is much greater than the numbers we can see.

From the very beginning, the database has always been one of the bottlenecks of the Alipay system. Previously, many upgrades have been made to the database in conjunction with the architecture transformation. In addition to the elastic transformation mentioned above, the following upgrades have also been made:

1. Separate the original transaction account database into transaction database and account database, and solve the data consistency problem through distributed transactions.

2. The database is split horizontally , and all users are divided into 100 parts at a granularity of 1%, combined with unitized logical isolation.

3. Database read-write separation, multi-point writing, and data replication can greatly improve performance.

The commercial database used by Alipay in its early years had limits on the improvements that could be made. For cost considerations, it was impossible to purchase additional database systems and equipment for major promotions that lasted only a few days a year.

As early as the Double Eleven shopping festival in 2014, Alipay's self-developed database OceanBase began to take on 10% of the Double Eleven core transaction traffic, and then gradually took on 100% of the traffic of core systems such as transactions, payments, and accounting, withstanding the rigorous test under extreme conditions.

From day one, OceanBase was planned to be a distributed relational database, which means it naturally supports large-scale and high-concurrency scenarios. However, Alipay itself has a large number of users, and the system pressure during the Double 11 Shopping Festival was too great. By the time of the Double 11 Shopping Festival in 2017, even with the use of additional elastic libraries, the database CPU pressure was close to the upper limit, becoming a bottleneck for further expansion.

During the 2018 Double 11 shopping festival, Alipay proposed a million-payment architecture internally, which means that this architecture can support system pressures of millions of transactions per second. The core of this architecture is the OceanBase 2.0 distributed partitioning solution.

In the past, DB expansion under the architecture was achieved by adding new DB clusters and adding data sources to applications, because there was a limit on the number of DB machines and one UID corresponded to one machine at most. This would bring a series of problems, such as the increase in application memory, time-consuming and labor-intensive pop-up and rebound operations due to multiple data sources, and high daily maintenance costs for multiple DB clusters. To solve this problem, we considered allowing the DB to be dynamically expanded like the application, and we must break the limit of one UID corresponding to one machine at most, so that the application and DB can be expanded simultaneously, and new capacity support capabilities can be achieved without adding new DB clusters.

Therefore, based on the partitioning function of DB, the scalability of DB is greatly enhanced, avoiding the embarrassment of having to add clusters to expand capacity. At the same time, relevant upgrades and modifications are carried out on the application, such as the upgrade of the serial number architecture of the whole site, the transformation of a series of middleware, and the transformation of task retrieval scenarios.

OceanBase Partition Architecture

The elastic architecture of traditional databases physically splits data to different machines, which makes data access/R&D/later maintenance and data supporting facilities very cumbersome; at the same time, it is difficult to quickly recover resources after splitting, and data splitting and aggregation cannot achieve lossless business. Compared with the elastic architecture of traditional databases, the OceanBase 2.0 architecture does not invade the business at all. It realizes self-organization and load balancing of data shards through partitioning, realizes automatic routing by generating columns and partitioning rules, and eliminates distributed transaction performance overhead through partition aggregation (partition_group) to improve performance, thereby achieving lossless linear scaling. In addition, the share_nothing architecture between data shards realizes a high-availability architecture that isolates shard faults and eliminates single point failures.

On November 11, 2018, OceanBase 2.0 was successfully launched and supported all transaction and payment traffic. Moreover, this architecture based on the OceanBase 2.0 partitioning solution can be easily expanded to support millions of transactions, and the battle to cope with traffic peaks has come to an end for the time being.

5. Technical support: promoting technical standardization

Double Eleven is a test bed for new technologies, so how can we be sure that these technologies can effectively support the peak traffic? Especially in Alipay, where people’s financial security is involved, the consequences of any problems are extremely serious, so we must be extremely cautious.

In 2014, Alipay launched full-link stress testing, which became a magic tool for systematic technology verification; since 2017, Alipay has begun to build an automated and intelligent technology risk prevention and control system; on Double Eleven in 2018, the promotion control system was launched, and promotion-related technologies began to be standardized.

Schematic diagram of the central control of the big promotion

The big promotion central control is a one-stop big promotion guarantee solution. Its purpose is to accumulate the experience of previous big promotions, form routines and norms, and eventually develop in the direction of unmanned operation. There is no need for technical people to stay up late for big promotions.

With the promotion control center, automated lossless stress testing can be performed, and online stress testing can obtain the desired results without affecting ongoing business. Its core technical capabilities are the isolation of the environment, machine, and thread, as well as intelligent circuit breaking when stress testing is abnormal.

Stress testing is not a panacea, and some problems may be difficult to expose in stress testing. Since 2018, Alipay has also launched red-blue attack and defense drills to check whether emergency strategies, organizational guarantees, and response speeds are in place when abnormalities occur during peak promotions, and to verify whether the stability of new technologies meets standards.

In order to ensure the security of funds during big sales, Alipay has developed its own real-time fund verification system to achieve real-time verification of peak fund security and verify that every fund is accurate.

It is not enough to just have all the technologies ready. Before each big promotion, there are still many configurations that need to be switched. Once an error occurs, it will cause serious consequences. Therefore, Alipay has built a technical risk inspection capability for the final state. It conducts hundreds of automated configuration checks the day before the big promotion to confirm that all systems have entered the big promotion state to ensure that there are no mistakes.

As the clock gradually points to midnight, the big sale is about to begin.

6. The future is promising, we will go together

In summary, the Double 11 traffic peak tests the scalability of the architecture, the carrying capacity of the database, the strong scheduling capabilities of operation and maintenance, and the perfect technical support capabilities. In order to ensure the smooth completion of the promotion, the technical preparations that need to be made are far more than those mentioned in this article. There are many behind-the-scenes heroes such as full-link stress testing. Due to space limitations, I will not introduce them one by one here.

Alipay is also continuously updating its own technical equipment library. During this year's Double Eleven, Alipay also put several new capabilities to the test: OceanBase 2.2 was launched, and this version won the first place in the TPC-C benchmark test, smoothly supporting the new promotion; the self-developed Service Mesh was on the promotion stage for the first time, and currently has 100% coverage of Alipay's core payment links, and is the industry's largest Service Mesh cluster.

With the implementation of inclusive finance and the development of the Internet of Everything, the traffic pressure faced by payment platforms will further increase. The peaks we see now may be commonplace in the future; the peaks in the future may be several orders of magnitude higher than today. The battle for payment peaks will continue, and the technology involved will continue to evolve. The technology battle on Double Eleven will be even more exciting.

Double Eleven is not only a shopping festival, but also a driving force for the development of Internet technology. Looking forward to 2020.

<<:  Reiner Klement, Vice President of Qualcomm: How will “5G+AI+Cloud” transform future industries?

>>:  Good news for PC users: WeChat Windows version updated: can synchronize mobile phone floating window

Recommend

Advertising design industry improvement video course

Introduction to video course resources for improv...

How does Tmall create anthropomorphic IP?

People say that the "New Year atmosphere&quo...

100 highly efficient tools, essential for operators and promoters!

Today I will share with you some common tools for...

Case Study: Review of Tmall’s 21-Day Vitality Plan

From the news on April 10th that Yi Yang Qianxi w...

How to write high-conversion copy? Remember this universal formula!

Operational copywriting is a copywriting that per...

APP paid delivery settlement methods and channels

This article mainly discusses the key delivery in...

Why did your memories before the age of three "disappear"?

Do you remember what you did before the age of th...

How to acquire customers through low-cost fission? Share 7 techniques

Since 2017, online traffic has become more and mo...

How can copywriting capture users?

The world is so busy, no one will waste one more ...