How does QQ construct its membership activity operation platform?

How does QQ construct its membership activity operation platform?

QQ Member Activity Operation Platform (AMS) is one of the important carriers of QQ member value-added operation business and a web system responsible for massive activity operations . Over the past four years, the number of AMS daily requests has grown from 2-5 million to 300-500 million, with the highest number of CGI daily requests reaching 800 million. During this process, AMS has undergone major adjustments and changes in its architecture, and we have gone through a very memorable technological journey.

This article will share the architectural design practice of the QQ member activity operation platform, hoping to be helpful to technical students.

1. Challenges of operating massive activities and our solutions

The development of a product business is always inseparable from the word "operation", and many forms of operation will be reflected in activity needs. The more operation-oriented the product business is, the more activity operation development needs it will usually generate.

When we talk about "activities", many people's first reaction is that this is something that does not require much technical difficulty. Generally speaking, if we only do 1-2 activities, there is indeed not much technical difficulty. However, if we increase the scale to 1,000 activities or even more, it becomes a technical problem.

1. Challenges and problems in event operation business

(1) Tencent’s SNG value-added services face challenges in operating and developing massive activities

Tencent's value-added product department needs to carry out continuous high-intensity operational activities in various businesses such as QQ membership system , game operation , personalization, etc. to promote new user acquisition , activity and retention . This itself has generated a lot of operational needs. Moreover, since 2014, as the mobile Internet has entered a mature stage, the demand for mobile game operations on the QQ platform has exploded, and the number of activities that need to be launched online in a month has increased several times.

(2) Complexity of activity development

Developing a campaign itself requires a certain amount of work. Especially for large-scale promotional activities, this type of activity has higher requirements on functions and performance. A typical large-scale event usually has tens of millions of users participating, so the performance requirements are relatively high. If high-concurrency functions such as "flash sales" or "rush purchases" are involved, it will be a strong challenge for the basic support system.

The event has many functions, including gift packages, lucky draws, sharing, invitations, redemption, rankings, payments, etc. These different forms of participation and expression will also involve more back-end interface communications and joint debugging. For example, our game operation business involves hundreds of games, and different games correspond to different service interfaces. There are thousands of game-related communication interfaces.

Another very important issue is the safety and reliability of event operations. This is because most of our activities involve the distribution of important physical prizes, such as iPhones , iPads and other high-value gift packages, which have very high security requirements.

(3) Human resource issues in event operation and development

In the traditional manual development model, ordinary activities also require a 1-week development cycle, and typical large-scale activities require a 1-2 week development cycle, and the development and testing workload is heavy. In addition, many activities are promoted on designated holidays and usually have strict online time requirements. Faced with urgent and rapidly growing operational needs, manpower is very limited.

Currently, there are more than 4,300 activities online throughout the year.

2. The nature of the activity and our methodology

Through the analysis and abstraction of activity patterns of different businesses, we found that in fact most activities can be abstracted and encapsulated in the form of a set of "conditions" and "actions", thereby forming common "condition" (Rule) and "action" (Operation) activity components. The combination of different conditions and actions becomes the implementation of activity logic. Then, we hope to unify these components through platform-based and framework-driven development. At the same time, at the framework and platform level, a framework support environment with high reliability, high performance, overload protection and horizontal expansion capabilities is provided for the operation of active components.

The activity component only needs to encapsulate its own business logic, and the core functional framework automatically supports it, thereby achieving complete automation of activity operation and development.

The task that AMS needs to undertake is to realize this plan. There are three main issues that need to be addressed:

(1) Build an efficient activity development model (operational development automation).

(2) Build an operation support platform with high reliability and high availability.

(3) Ensure the safety of event operations.

2. Build an efficient activity operation and development model

In early 2012, before the emergence of AMS, the activity development model was relatively casual, and there was no strict and complete framework support, and the degree of component reuse was not high enough. Therefore, it often takes us more than a week to develop an activity. At that time, one of the characteristics of development activities was "each doing its own thing". Each operations and development student produced a batch of front-end and back-end components, and the CGI layer also generated many entrances with different rules. The structures of these components are messy, unsystematic, and difficult to maintain. Most importantly, such components are complex to use and have a low reuse rate for activity development, resulting in relatively low development efficiency.

At that time, there was a certain degree of accumulation of demand for event operations, and many demands lacked manpower support. Our product colleagues also felt that we were slow in launching events.

1. System architecture layering and unification

Based on this problem, the first solution we thought of was to integrate the front-end and back-end components and rebuild a clear and unified system. Based on the principles of layering, reusing and simplifying the system's interfaces, a complete system is gradually built. Moreover, from our development perspective, the most important goal is to reduce the workload of activity development, liberate developers, and improve R&D efficiency.

Our front-end components are integrated through a framework called Zero. Each front-end function appears in the form of a component, which is maintained and reused in a unified manner. The CGI layer has undergone code reconstruction and implemented framework-driven development, bringing each business logic function into a single entrance and a unified system. The core functional framework is automatically supported, and existing active functional components can be directly configured and used. If no new functions are added, operations development only needs to configure a simple set of parameters to complete the back-end functional logic without having to write code. Basic support services are managed in a platform-based manner, with unified access and maintenance.

After we finished adjusting the system structure, we finally realized that we could control the combination of front-end and back-end components through an activity configuration. Each condition, shipment and other actions can be dynamically combined at will. The participating conditions can be combined through "and", "or", "not", etc. to select the corresponding action to realize the activity function logic.

Since then, activity development has become much simpler, the amount of code that needs to be written has been greatly reduced, and it has basically become a job of "filling in parameters". The code of an active project has been reduced from 1,000-2,000 lines to less than 100 lines.

For example, as shown in the figure below, the gift package that originally required writing a lot of logic code has become just one line of parameters on the front end.

The clear structure improves the maintainability of the system, and more importantly, the efficiency of activity development is also greatly improved .

With the development manpower remaining unchanged, we have achieved a significant improvement in the efficiency of event development and the backlog of product demand has been effectively alleviated.

2. Highly visual development mode (automated operation)

However, in 2014, with the rapid development and gradual maturity of " mobile Internet ", we also ushered in the era of " mobile game explosion". Because the development cycle of mobile games is faster, many new mobile games are launched almost every month, and soon the demand for mobile game event operations has exploded. The demand for activities undertaken by AMS has rapidly increased from more than 60 per month to 200. Against this background, development manpower is once again stretched thin, and the backlog of demand has been further exacerbated.

Since we are talking about developing human resources, we must introduce our current activity project model. Tencent is a mature Internet company. Every link in the R&D process (design, reconstruction, development, experience/testing, and release) is completed by different independent roles. The time required for an ordinary mobile event project is calculated according to the fastest and most ideal mode: 1 day for design, 1 day for reconstruction, 2 days for development, and 1 day for experience/testing. It will take at least 5 working days, which means the R&D cycle will be at least 1 week. Ideals are beautiful, but reality is always cruel. In the actual project implementation process, due to various resource coordination and external factors, it is usually impossible to achieve such perfect coordination. Therefore, the R&D cycle of an ordinary activity often exceeds 1 week.

Suddenly adding more than 100 new requirements is a huge pressure for any team.

Therefore, we have to adopt another way of thinking to look at event operations. Can we try not to invest in development manpower? We call it "automated operation". The essence of automation is to build a sufficiently powerful platform and tool support to allow operations staff to complete activity development themselves.

Earlier, we mentioned that when developing ordinary activities, each functional point has become a simple configuration, and the work of activity development is to fill in the activity parameters of this configuration into the page button. If we implement a visual tool and turn the work of filling in the configuration into the function of dragging buttons, we can completely say goodbye to the work of "writing code".

The final result is that we made a visual drag-and-drop active template system. Operations staff only need to undergo proper training to learn how to use it. First, the operations team uploads the activity design drawing, and the template system automatically cuts the drawing (completing the reconstruction work). Then, the activity function is configured and inserted into the page by dragging the button function component (essentially a div transparent mask). Then click Experience and Release to finally complete the activity launch. Because our functional components have been rigorously tested before being provided to operations staff, they usually do not require technical testing staff to do the testing.

Because from then on, operations colleagues began to replace development, reconstruction, and testing work on a large scale. However, they were a group of people who did not understand the technical details, which also invisibly increased the risk of launching the activity. Therefore, in addition to the implementation of this activity template, we have also built a series of supporting platforms and tools based on the characteristics of the AMS platform.

In short, in order to avoid "human errors", human errors cannot be avoided by humans themselves, but must be guaranteed and detected by platforms and programs . Therefore, we built a powerful and intelligent system of configuration checking and activity data monitoring. For example, there are 100 gift certificates in the resource pool, but the operations staff mistakenly configures it to 200. At this time, the platform will detect and prompt the operations staff that the configuration is incorrect.

Automated operations have brought us optimization at the R&D process level. In the activity R&D process, I have reduced the processes of reconstruction, development and testing, which has greatly shortened the activity project R&D cycle and achieved a qualitative leap in the activity project R&D efficiency. The backlog problem of mobile game operation needs has been fundamentally and thoroughly solved.

The completion of our efficient event development model has also led to the rapid growth of our AMS platform business scale. In October 2015, we launched more than 400 activity projects a month, and more than 80% of them were template activities "developed" by our operations colleagues.

3. Reliability and performance support construction

By building an efficient event development model, we have enabled the business scale and traffic scale of our AMS operating platform to grow 100 times over the past three years, with more than 1,000 events online at the same time. At the same time, the reliability and stability of the AMS platform has also become one of the most important indicators. If there is a problem with the platform, the impact will be very wide.

The architecture of the AMS platform is divided into four levels: entry level, business logic level, service level, storage level (CKV's NoSQL storage), and an offline service and monitoring system.

1. Reliability

The event operation business is very sensitive to the reliability of the platform, because it involves the distribution of many high-value gift packages, and some of them also involve payment links. Stability is paramount.

To ensure availability, we do the following:

  • All services and storage, stateless routing (L5). The purpose of doing this is mainly to avoid single point risk, that is, to prevent the entire service from being paralyzed if a service node fails. In fact, even some access services with master-slave nature (if the main machine fails, it supports switching to the backup machine) are not reliable enough. After all, there are only two machines, and it is still possible that both of them fail. Our backend services are usually provided in the form of a group of machines, with no state relationship between them, supporting random allocation of requests.
  • Supports parallel expansion. In case of high traffic, capacity expansion can be achieved by adding machines.
  • Automatically remove abnormal machines. In our routing service, when a service machine is found to be abnormal, it can be automatically removed and then added back after it recovers.
  • Monitoring alarms. The success rate is counted in real time. If the success rate is lower than a certain threshold, an alarm will be issued to notify the service manager.

In terms of alarm monitoring, the construction of the AMS platform is more rigorous. We strive to provide multi- channel alarms (rtx, WeChat , email, SMS) and multi-dimensional monitoring (L5, inter-module calls, automated test cases, AMS business monitoring dimensions, etc.). Even if some monitoring dimensions fail, we can still detect problems in the first place. Of course, we will also control the alarm cycle and algorithm to minimize harassment while truly discovering system problems.

Another challenge for reliability is overload protection. No matter how many machines our system has, there is always a risk of overload in certain special scenarios, such as promotions such as "flash sales" and "scheduled starts". AMS currently has more than 1,000 activities online at the same time, which is already too many. Among these activities, there are always large -traffic promotions occasionally, and the business side is not even aware of us. No matter what the scenario is, we must ensure that the AMS platform itself does not "collapse". If the cluster fails, it will affect all users. Overload protection only discards some user requests, and most users can still get normal services.

In terms of overload protection, we have taken some simple measures:

  • Overload protection at platform entrance. Based on business characteristics and machine operation experience , set the maximum number of Apache service processes/threads to ensure that the machine will not crash or the service will not crash under this number.
  • Protection of numerous back-end delivery interfaces. Behind the AMS platform there are hundreds of backend service interfaces, and their performance varies. Protect weak backend interfaces through internal current limiting on the AMS platform. Because if they are overwhelmed, it will cause certain types of interfaces to be completely unavailable.
  • Service degradation. Bypass non-critical paths, such as data reporting services.
  • Core services are deployed independently and physically isolated. Don't put all your eggs in one basket. The purpose of physical isolation is to avoid mutual impact between businesses and protect the normal operation of other services.

2. Business protection in flash sales scenarios

Flash sales are a common form of participation in event operations. In addition to the problem of traffic impact, they also bring challenges such as business logic security issues under high concurrency. At this time, we must introduce appropriate locking mechanisms to avoid these problems. It's the same type of problem as thread safety.

The first is the user's session lock, that is, in the same sub-activity function, the same user is prohibited from making a second request before the previous delivery request is completed. The reason for doing this is that if the same user initiates two concurrent requests, multiple gift packages may be sent within a critical time.

For example, user A in the figure below, before the first request successfully writes the participation success flag, the second request can pass the "conditional judgment" and can still enter the delivery stage. In this case, user A may get 2 gift packages.

There is also a lock that is based on multiple users' flash sale protection lock. The scenario is similar to the Session lock, but it becomes multiple concurrent users requesting the same gift package. Similarly, during the critical time for judging the number of gift package remaining, "over-issuance" (too many gift packages are issued) may occur.

The problem is obvious, and it can certainly be solved by using locks, but what kind of lock mechanism to use is another question worth thinking about. Because the business scenarios are different, the solutions chosen are naturally different. We will discuss the implementation mechanism of flash sales from three different perspectives.

  • Queue service. This is a relatively simple implementation idea. We directly put the flash sale requests into the queue and execute the requests in the queue one by one, just like forcibly turning multi-threading into single-threading. However, in a high-concurrency scenario, there are so many requests that the queue memory may be "burst" in an instant, and then the system falls into an abnormal state. Alternatively, designing a huge memory queue is also an option. However, the speed at which the system processes requests in a queue is usually not comparable to the number of requests that are madly pouring into the queue. In other words, the requests in the queue will accumulate more and more, and the user requests at the back of the queue will have to wait a long time to get a "response", and real-time feedback on user requests cannot be achieved.
  • Pessimistic locking idea. Pessimistic locks have strong exclusive and exclusive characteristics. After a user request comes in, it is necessary to try to obtain a lock resource. The request to obtain the lock resource can be executed. If the acquisition fails, try to wait for the next grab. However, in high-concurrency scenarios, there are many such snatch requests, and they keep increasing, causing a backlog of such requests on the server. The vast majority of requests have not been able to successfully grab the request and have been waiting (similar to "starving" in threads). Users also cannot get real-time responses and feedback.
  • Optimistic locking idea. Compared with the "pessimistic lock", it adopts a more relaxed locking mechanism and uses version number updates. The implementation is that all requests for this data are eligible to modify (execute the delivery process), but will obtain a version number of the data. Only those with a matching version number can be updated successfully (if the version does not match, it means that it has been successfully modified by a certain request). Other user requests will immediately return a purchase failure, and users can also get real-time feedback. The advantage of this is that it can implement the locking mechanism without causing a backlog of user requests.

Our business lock is implemented in an optimistic way, because one of our delivery processes usually takes more than 100ms. Under high concurrency, it is easy to generate request backlogs, which makes it impossible for us to provide real-time feedback. Our implementation ensures that users can get real-time feedback within 500ms regardless of whether their request for a flash sale is successful or not. In addition, we have widely used this implementation in various flash sales and rush purchases. It has supported 50,000 flash sales per second and performed very smoothly and safely.

IV. Construction of business security system

As the business scale grows, the number of shipping operations sent out by the AMS platform every day increases. On non-holiday days, more than 50 million items are shipped every day, and at peak times, more than 200 million items are shipped. At the same time, the activities here contain many high-value items, such as iPads, iPhones, high-value virtual props, and even some activities promote the use of cash gift packages (paid via Tenpay).

Therefore, our business security requirements are higher and more stringent than those of ordinary Internet products.

1. Traditional security attack dimensions and malicious users

Mature Internet companies usually have their own security teams. They usually build a database of malicious user blacklists through data modeling, and then continuously maintain information such as these malicious accounts and IP addresses and update the data. Then, we connect this service into it. Malicious studios have a large number of accounts and IPs, and we intercept them through this malicious database.

However, no matter how sophisticated the data modeling algorithms are, there will always be a hit rate problem in order to prevent the accidental killing of real users. They usually cannot intercept all malicious requests, and there will always be a few that slip through the net.

What we are thinking about is to combine the business with this foundation and add new security protection strategies. Many people may wonder whether adding participation thresholds can achieve further protection? For example, on the basis of traditional security crackdown strategies, we add business restrictions, such as setting the activity participation conditions to super membership (paid membership of 20 yuan per month). In this way, we use a higher threshold to intercept malicious requests.

For a long time, I thought this approach should be reliable because it raised the threshold for participation. Until one time, we captured a batch of tens of thousands of malicious QQ numbers (all of which were junk numbers with very long numbers), and they were all super members. The malicious studio actually spent a lot of money to open super membership for them at 20 yuan a month. From that time on, I began to understand that the restrictions on paid membership were also unreliable.

The status of super membership brings more convenience to these malicious numbers, and can give them the opportunity to obtain more high-value gift packages, which can be converted into money and then cover the " investment cost" of the malicious studio.

2. Construction of business security support system

AMS builds multi-dimensional, all-round security support capabilities. We divide these security constructions into four dimensions:

  • Portal security: CGI portal that connects to users, filters malicious and attack requests, protects business logic security, etc.
  • Human errors: Development and operations colleagues may make erroneous operations or configuration errors on the management side. Human errors cannot be guaranteed by humans. Instead, platform-level permission management and intelligent detection should be built to prevent people from making mistakes.
  • Operation monitoring: Multi-dimensional monitoring of business status to ensure rapid discovery of problems.
  • Audit security: fully collect, manage and monitor sensitive permissions to ensure controllability and traceability.
Author: Xiaoshiguang Tea House, authorized to be published by Qinggua Media .

Source: Tech Teahouse

<<:  In addition to spending money to buy traffic, what other ways are there to gain traffic for free with iOS 11?

>>:  RWDS Illustration Course, First Edition, From Beginner to Metamorphosis [High Definition with Materials]

Recommend

Zhihu traffic mining methodology: How to mine most effectively?

Doing well on Zhihu is equivalent to doing well o...

How to use these 5 marketing psychology to make users buy willingly?

Copywriting cannot create the desire to buy a pro...

How to evaluate, monitor and promote KOL marketing channel conversion?

With the rapid development of the Internet , we h...

Analyze Pinduoduo's event operation system and coupon gameplay!

Pinduoduo’s slogan is “More affordable, more fun”...

8 ways to play red envelopes to activate the community

Community operation is the term that operators ar...

Short video promotion and operation: Who has better ability to bring goods?

Why do the giants all want to make short videos? ...

Holiday marketing promotion strategy!

Stimulating consumption has become a common conse...

How to optimize single-page website SEO? What are the common SEO problems?

Q: What is the specific process of keyword optimi...

Two scenarios for Zhihu content to promote sales

Yesterday a client asked me why my popular conten...

How to explain your business model in 5 questions in one minute?

Before reading this article, please think about t...

Chen Changwen's Love Guide 2 - Gene War

Compromising with each other to come up with the ...