vivo Hawking Experiment Platform Design and Practice-Platform Product Series 02

vivo Hawking Experiment Platform Design and Practice-Platform Product Series 02

1. Introduction

After experiencing the wild growth and pioneering bonus period, Internet companies have gradually attached more importance to the scientific and refined development of products, and transformed from extensive to intensive. In the United States, data-driven growth methodologies such as growth hacking are helping global technology giants such as Google, Microsoft, and Facebook achieve sustained business growth; in China, fine data operations and AB experimental analysis to drive effective business growth have gradually become a consensus and a core means. Among them, the A/B testing platform, as a typical representative, has naturally become an indispensable core tool for mainstream domestic companies, effectively improving the conversion efficiency of traffic and the iteration efficiency of production and research.

In the past few years, vivo Internet has continued to attach importance to scientific experimental decision-making, which means that all changes to users must be based on the corresponding experimental conclusions. For example, modifying the background color of the top ad and testing a new ad click-through rate (CTR) prediction algorithm all need to be done through experiments, so a powerful A/B experiment platform is very important. The vivo Hawking Experiment Platform (hereinafter referred to as Hawking) has grown from a single system to a company-level one-stop platform for solving A/B experiment-related problems, helping the Internet's core businesses to conduct fast and accurate experiments and efficiently promote business growth.

2. Project Introduction

2.1 A/B Experiment

In the Internet field, A/B experiments usually refer to an iterative method that can guide how to improve existing products or services. Taking the improvement of the order conversion rate of a certain product as an example, during the AB experiment, we designed a new order page. Compared with the original page, the page layout and copywriting were adjusted. We randomly divided user traffic into two groups, A/B (corresponding to the new and old pages respectively), 50% of users saw the A version page, and 50% of users saw the B version page. After a period of observation and statistics, it was found that the order conversion rate of version A users was 70%, higher than the 50% of version B. So we concluded that version A was effective, and then pushed and displayed the new page to all users.



The above is a specific application of using AB testing to iterate product functions. We divide the complete life cycle of A/B experiments into three stages:

  1. Before the experiment, clarify the improvement goals, define the experimental indicators, and complete the development and launch of relevant functions;
  2. In the experiment, the traffic ratio of each experimental group is determined, and online traffic is opened for testing according to the diversion ratio;
  3. After the experiment, the experimental results are evaluated and decisions are made.

2.2 Hierarchical experimental model

Hawking's layered experimental model was designed with reference to the overlapping experiment framework paper released by Google: "Overlapping Experiment Infrastructure: More, Better, Faster Experimentation".

2.3 Platform development and its application and value in vivo business

Hawking was launched in 2019. After more than three years of development, the average number of experiments per day has reached more than 900, and the peak number is 1000+.

  • Supports vivo's domestic and overseas business and serves more than 20 departments of the company.
  • The standardized experimental process lowers the experimental threshold and improves experimental efficiency.
  • Automated data analysis tools can be used to assist businesses in making quick decisions, improve product iteration speed, and effectively promote business development.
  • The platform capabilities are reused to avoid duplication of construction by different organizations, effectively improving production efficiency.

3. Hawking System Architecture


3.1 Experimenters

Experimenters include multiple roles, allowing business parties to conduct experiments, manage indicators, and analyze experimental results in the Hawking management background.

3.2【Experiment Portal】

It includes two functions: experiment management and experiment effect analysis.

3.2.1 Experimental Management

The platform provides a visual page for business parties to configure experiments, select diversion strategies, allocate traffic, and manage whitelists.

3.2.2 Experimental effect analysis

It includes the following four core capabilities:

1. Indicator Management

Different experiments focus on different indicators. In order to realize the automation of effect evaluation, the platform provides indicator configuration and integration capabilities.

[Must-see indicators]: Usually core business indicators. Each experiment must ensure that there are no obvious negative indicators. The platform integrates a big data indicator management system, and these indicator results directly reuse the data services of the indicator management system.

[Personalized indicators]: usually indicators for temporary analysis in experiments, such as a banner style experiment, observing the exposure and click-through rate of a specified banner. The platform provides the ability to configure custom indicators, and automatically generates computing tasks through the big data computing platform to achieve the ability to automatically output data for custom indicators.

2. Comparative analysis and significant conclusions

The platform provides visualization components such as comparative analysis and significant conclusions for the visual display of effect evaluation. It can very intuitively inform the experimenter of the overall improvement of each experimental scheme compared with the control scheme and the daily increase or decrease. It also provides the confidence interval and significant conclusion of the indicator.

3. AA analysis

The AA analysis provided by the platform is intended to help experimenters verify whether there are significant differences in core business indicators among the populations of people who actually enter the experiment with different schemes before the experiment, and to assist experimenters in determining whether the experimental conclusions are reliable.

4. Real-time monitoring of diversion

The effect of real-time diversion can be intuitively seen, and abnormal traffic can be manually intervened and resolved in a timely manner.

3.3 [Experimental diversion service]

1. Multi-terminal access service

The platform provides rich access capabilities based on different business demands, such as Android SDK for Android clients, JAVA SDK for servers, H5 experimental services based on NGINX for traffic diversion, dubbo/http services, and C++ SDK to be built.

2. Experimental diversion method

The platform provides stable and efficient online real-time diversion services.

  • [Random Diversion]: Randomly group and divert people based on user identifiers based on a hash algorithm
  • [Specified crowd diversion]: Before the experiment, a group of people are identified and labeled for diversion
  • [Even distribution of covariates]: When the population is randomly grouped based on the hash algorithm, although the number of people in the groups is divided equally, the distribution indicators of the grouped people are uneven, resulting in the experimental effect not meeting expectations. In order to solve this pain point, the platform launched a covariate balance algorithm.

This algorithm can ensure that the sample size of the population is uniform when grouping and the distribution of indicators in the population is also uniform.

For details, please see this article: 4. Practice of vivo Hawking Experiment Platform → 4.1 Detailed introduction to the covariate balance algorithm.

  • [Web service diversion based on openresty]: For businesses that do not use JAVA language on the server side and have strict performance requirements, we implemented a set of experimental diversion functions based on OpenResty in NGINX using lua scripts, providing services with HTTP interfaces, with an average response time of less than 1ms and p9999<20ms.

3.4 [Diversion data processing service]

The company's unified data collection components are used to collect and process diverted data, and finally store it in HDFS.

3.5 [Indicator Calculation Service]

Independent services are used to efficiently calculate indicator results. They are also equipped with indicator calculation failure retry and monitoring alarm mechanisms, effectively ensuring the success rate of indicator calculation.

3.6 Data Storage

MySQL is mainly used to store business data. Ehcahe is used as the main cache for the experimental configuration, and Redis is used as the auxiliary cache. Finally, the data diverted by the experiment is processed and saved in HDSF for subsequent experimental data analysis.

4. Hawking Practice

The above introduces the development of Hawking and the overall system architecture. Next, we will introduce the problems encountered during the development of the platform and the corresponding solutions.

4.1 Covariate Balancing Algorithm

4.1.1 Problems encountered

When the business side groups experimental subjects, the most common practice is to hash a certain attribute of the experimental subject and then take the modulus of 100, and divide them into different groups according to the result. Although the hash algorithm can achieve uniform distribution of the tail number segment, after the grouping, there may be uneven distribution of experimental subjects in different groups in certain indicator characteristics, resulting in inaccurate evaluation of the experimental effect. As shown in the following figure (the four different colors in the figure represent different groups of people and corresponding indicator types):


Is there a way to achieve uniform grouping of people while ensuring that the corresponding indicators of the people are also evenly distributed, as shown in the following figure:

4.1.2 Solution

1. Covariate Balancing Algorithm

This algorithm can ensure that the population is evenly grouped and the corresponding indicators of the population are evenly distributed. The whole algorithm consists of three parts, as shown in the following diagram:


(1) Offline stratified sampling

The following three steps are required:

  • Determine core indicators with the business side
  • Use equal-proportional stratification + Kmeans clustering model to complete stratified sampling of users corresponding to the indicators
  • Write the stratified sampling data into the relevant table of the hive library

Here we introduce the proportional stratified sampling

Equal proportion stratified sampling:

The number of samples to be drawn from the i-th layer; the total number of samples from the i-th layer; N is the total number of available traffic; n is the number of traffic samples set for this experiment

Assume N is 3kw and n is 50w. Classify according to the following dimensions:


There are 9 combinations in total. Determine the proportion of each combination in the total amount (total N=3kw , by screening specific groups from all available traffic):

The number of samples in each layer is calculated by the formula; the number of samples in the corresponding classification (total sample size 60w):


So far, the entire offline stratified sampling work has been completed. Next, we will introduce real-time uniform grouping.

(2) Real-time uniform grouping

There are 4 steps to follow:

  • Data synchronization <br>The prepared stratified sampling data is synchronized from the relevant tables of the hive library to redis through the configured scheduled tasks. The data includes the mapping of daily user identifiers (uid, the same below) to layers, as well as the proportion of users in each layer every day.
  • Experiment creation <br>Create an experiment by experiment ID, experiment group ID and sample size of each experiment group. The created experiment will be associated with the user data of the latest day. The sample size of each experiment group in the layer can be determined by sample size * user proportion of the layer.
  • Experimental diversion <br>Use the experimental number and user identifier (uid) to first find the layer where the user is located, and then evenly distribute the user to the experimental group under the layer to ensure that the users diverted between experimental groups on different layers are evenly distributed. As shown in the following figure:


  • User data deletion

    Because the solution we adopt requires synchronizing a large amount of data every day, useless user data needs to be deleted in a timely manner to increase resource utilization.

The overall flow chart of real-time uniform grouping is as follows:



When performing real-time uniform grouping, we face performance and storage pressure. For this reason, we have designed a high-performance diversion solution and a high-memory utilization user information storage solution.

High-performance flow control solution

We use different redis data structures and Lua scripts to achieve uniform distribution of buckets under the layer

Solution 1

Pre-allocate the sample size for each bucket, and select the bucket with the largest current sample size each time
Redis structure: HASH, field is the corresponding bucket number, value is the current sample size corresponding to the bucket

Solution 2

Pre-allocate the sample size for each bucket, and select the bucket with the largest current sample size each time
Redis structure: SORTED SET, key is the corresponding bucket number, score is the current sample size corresponding to the bucket

Solution 3

Select the hit bucket by taking the current layer sample size and bucket size modulo
Redis structure: HASH

Solution 1: It has the highest performance when there are only two buckets, which is 1.05 times that of Solution 3, but its performance decreases linearly as the number of buckets increases.
Option 2: Have stable performance.
Solution 3: has stable performance, but the performance is 1.12 times that of Solution 2 and 58% of the performance of a single GET request.

After comprehensive consideration, choose option 3.

Design of user information storage solution with high memory utilization

Comparison of memory consumption of uid-layer


Solution 1: Use redis string storage.
Solution 2: Divide into 10,000 hash storages.
Solution 3: Divide into 10,000 first-level buckets, each with 125 second-level buckets.

Assume that the uid is 15 digits, the layer id is 2 digits, the expiration time is taken into account, the additional consumption in cluster mode is not considered, and the malloc memory fragmentation and occupancy are not considered.

After comprehensive consideration, choose option 3.

(3) Offline analysis and verification

Because the experimental process of the covariate algorithm is relatively complicated, we still use the manual data collection method to analyze the experimental results.

4.2 Java SDK

4.2.1 Problems encountered

The early Java SDK had weak capabilities and only provided diversion. The access party was required to report the diversion result data, which caused a high cost for the access party. Therefore, the Dubbo interface was mainly used for external diversion services, and the platform server was responsible for reporting the diversion result data within the service.

As more and more parties are connected, Dubbo thread pool is frequently exhausted or diversion fails due to network reasons, resulting in a poor diversion experience and affecting the analysis of experimental results. In the face of the above problems, in addition to continuously optimizing performance, the platform also needs to continuously expand the capacity of application servers and other resources, resulting in a certain amount of resource waste.

4.2.2 Solution

In response to the above situation, the Hawking development team has conducted a thorough technical solution research and upgraded the Java SDK several times to successfully solve the above problems. At present, it has core functions such as experimental diversion, reporting of diversion results, real-time and incremental update of experimental configuration, and SDK self-monitoring, which greatly improves the stability and success rate of diversion.

4.2.3 SDK's six core capabilities

1. Reporting of diversion results

The diversion results are reported within the SDK based on the company's data collection components.

2. A fallback plan for failure to report diversion results

When reporting diversion, the data link cannot guarantee 100% integrity of the data. If there are machine crashes, business service abnormalities, network abnormalities, etc., the experimental diversion data reporting will fail, which will directly affect the experimental effect analysis.

How to ensure that the experimental diversion data is 100% not lost under any circumstances? For this reason, the Hawking Experiment Platform has designed a solution for diverting data to disk as a backup measure for abnormal scenarios, thereby 100% ensuring the integrity of the data. The design diagram is as follows:

3. Experimental configuration real-time & incremental update

In addition to pulling the experimental configuration to the local cache of the business party through scheduled tasks, real-time and incremental updates are also provided, which are suitable for businesses that have high requirements for the timeliness of experimental configuration changes. They can be controlled by switches and take effect dynamically. The default is real-time incremental update + regular full update, which is convenient for the business party to use flexibly. The following is a flowchart for configuring real-time and incremental updates:


In the event of a real-time update failure, we have designed a failure backoff strategy: an exponential failure backoff strategy is used, with the default long polling interval of 1s. The interval increases by 2 times for each failure, with a maximum of 60, so the growth sequence is 1, 2, 4, 8, 16, 32, 60; the interval is set to 1 for each success.

In addition, we have ensured the final consistency of data, ensuring that the SDK can eventually pull the latest configuration when pulling configuration, and there will be no configuration rollback:

  • The refresh of the experiment information and module information cache is linear.
  • The refresh of the experiment information cache of the same change is before the refresh of the module information cache (when sending the cache refresh message, ensure that the experiment cache refresh message is before the module cache refresh message).
  • There will be no version number jump problem when refreshing the module information cache (the cache method inputs the above version number, and compares the database version number with the passed version number when refreshing the cache. If the version numbers are inconsistent, the log is printed and the passed version number is used as the version number for this cache refresh).
  • When the SDK pulls configuration and updates the local configuration, only the configuration with the pulled configuration version number greater than or equal to the local configuration version number is updated.

4. Multi-level configuration management

The SDK supports multi-level configuration management, with the priority being: method parameter configuration (original) > business configuration center grayscale configuration > business configuration center configuration > remote default configuration > local default configuration; business configuration center grayscale configuration means grayscale function by configuring the specified machine IP in the configuration center.

5. Traffic diversion strategy

One of the pain points of using SDK is that after adding new functions, the business side needs to upgrade accordingly, otherwise the new functions cannot be used, which in turn affects the business. For this reason, Hawking has designed a backup solution. When the SDK detects that the newly added policy does not exist, it accesses the Hawking server through the dubbo generalized call method to ensure the normal diversion function; effectively ensuring that the business side has sufficient time to upgrade to the latest version and improve the user experience.

6. SDK monitoring alarm

Another pain point of using SDK is that SDK is integrated into the process of the business party's server. The printed error information cannot be seen by the business party, and it depends on the feedback from the business party, which seems very passive and cannot follow up and solve the problem in the first time. To address this situation, Hawking Experiment Platform has designed a set of SDK self-monitoring solutions.

After pre-aggregation according to time accuracy, self-monitoring data is reported to general monitoring through the tracking domain name. Self-monitoring supports domestic sales, Singapore and India environments. Through monitoring, we can intuitively see whether there is error information in each business, experiment, SDK version and other dimensions, and configure alarms according to the corresponding dimensions, so that developers can follow up and solve problems in the first time.

4.3 H5 Experiment

4.3.1 Problems encountered

When the industry conducts H5 experiments, the usual practice is to develop an H5 SDK and let the business side introduce it on the front end.

There are several problems:

  • The business side needs to make code changes on the front end for adaptation
  • In addition, the experimental pages or elements need to be masked, because page jumps have a certain impact on user experience.
  • When experimental functions change, the business side needs to upgrade the H5 SDK
  • The entire H5 experiment access cycle is relatively long, and there is a certain access threshold

4.3.2 Solution

So is there a simple and quick way to complete the access of the entire H5 experiment by only configuring the experiment in the background? To this end, the Hawking development team designed a solution to successfully solve this problem. The entire H5 experiment architecture is built based on the open source apisix. When the business party creates an experiment in the Hawking management background, all nginx-based routing configurations are automatically issued through the interface (cooperating with the work order review). The access party does not need to make any changes at the code level, which is non-invasive and greatly improves the efficiency of the business party in doing H5 experiments.

Here are some nouns to explain:

  • [Public VUA]: vivo unified access. vivo unified access layer can be understood as a subsequent product to replace nginx, built on the open source apisix.
  • [Hawking VUA]: A VUA platform built exclusively for Hawking. When doing H5 experiments, the public VUA will proxy the pages that need to be experimented to Hawking VUA. Hawking VUA completes the experiment diversion through the Hawking diversion APISIX plug-in developed.
  • [VUA Change Platform]: Configuration changes based on NGINX are visually operated on this platform and then sent to the VUA platform (Public VUA/Hawking VUA).

(1) Overall timing diagram of H5 experiment


(2) Switching from NGINX to VUA

The public VUA proxies the pages that need to be experimented with to the Hawking VUA, which completes the experiment diversion through the Hawking Experiment Platform Diversion APISIX plug-in developed by the company.

Multi-version experiment diversion

1) Introduction to H5 multi-version experiment

When doing experiments with the same URL, through Hawking diversion, different users access the same URL, but the page access content is different (because the multi-version experiment publishes page version resources to different machines), and then access different resources is diverted through the Hawking experiment platform.

2) H5 multi-version experiment diversion principle

  • The public VUA proxies the static resource requests corresponding to the multi-version experiment to the Hawking VUA.
  • Hawking VUA selects upstream according to the experimental configuration through the APISIX plug-in and proxies to the corresponding static resource server.

3) Flowchart



  1. Multi-version experiment diversion
  2. Multi-page experiment diversion
  3. Hawking Experiment Platform Diversion APISIX Plugin
  4. H5 Experiment Diversion Data Collection

Multi-page experiment diversion

1) Introduction to H5 multi-page experiment

Experiment with multiple different URLs. Through Hawking diversion, different users access different URL pages.

2) H5 multi-page experiment principle

  • The public VUA proxies the static resource request of the entry business path corresponding to the multi-page experiment to the Hawking VUA.
  • Hawking VUA rewrites the path to different pages according to the experimental configuration through the APISIX plug-in.

3) Flowchart


Hawking Experiment Platform Diversion APISIX Plugin

The process diagram is as follows:



Plugin development specification reference:

https://apisix.apache.org/zh/docs/apisix/plugin-develop

H5 Experiment Diversion Data Collection

The diversion data of the H5 experiment is saved in the access_log of the Hawking VUA platform, and is finally stored in the DW table of the HIVE library through the following steps for subsequent data analysis.


V. Experimental Effect Analysis

This module includes indicator services, data analysis and effect display, quasi-real-time indicator calculation, AA analysis and other functions, which will not be expanded here due to limited space.

VI. Summary and Outlook

This article mainly introduces the platform-based and product-based construction and practice of A/B experiments in vivo, achieving the following values ​​and capabilities:

  1. Users can complete the closed loop of creating experiments - data analysis - decision-making - adjusting experiments on the platform, which is simple to operate and highly flexible.
  2. Provides scientific and reliable multi-layer traffic diversion algorithms, traffic can be reused, and product solutions, operation strategies, and optimization algorithms can be quickly verified without releasing a version;
  3. Provides real-time experiment diversion monitoring, hourly indicator monitoring, and offline data analysis functions;
  4. Supports custom indicators, no need to wait for analysts to develop reports, and can be checked immediately.

However, there are still problems such as user experience. In the future, we will focus on optimizing and improving the experimental process and data service functions such as indicator configuration (solidification of common indicators, simplification of indicator configuration) and data display (interaction optimization, multi-dimensional analysis, attribution analysis), so as to continuously improve the user experience.

References:

  • ​​Overlapping Experiment Infrastructure:More, Better, Faster Experimentation​​
  • ​​Application of AB Experiment in Didi Data Driven​​
  • ​​Java™ Servlet Specification (2.3.3.3 Asynchronous processing)​​
  • Github: apollo
  • ​​apolloconfig/apollo​​
  • ​​APISIX Documentation​​
  • ​​APISIX plugin development documentation​​
  • ​​Openresty Documentation​​

<<:  The Ultimate iOS Development Toolkit: Top 10 Essential Tools

>>:  vivo official website App modular development solution-ModularDevTool

Recommend

Faraday Future uses virtual reality to create FFZERO1 concept car

Faraday Future unveiled its first concept car at ...

20 Awesome Free Bootstrap Admin and Front-end Templates

1. SB Admin 2​​ ​​Details & Download​​ 2. Adm...

100 essential tips for advertising and marketing

There are many benefits to working in advertising...

The 12 Silicon Valley tech giants who pay the most attention to China

No one in the current technology industry can ign...

A Preliminary Study on Android Kotlin Coroutines

1. What is it (coroutines and Kotlin coroutines) ...

Whoever has the traffic will win the world?

Discussions about traffic have always been a hot ...

Does 400 number call cost money? Do I need to pay for 400 calls?

Many people think that 400 calls are free, and no...