1. IntroductionAfter experiencing the wild growth and pioneering bonus period, Internet companies have gradually attached more importance to the scientific and refined development of products, and transformed from extensive to intensive. In the United States, data-driven growth methodologies such as growth hacking are helping global technology giants such as Google, Microsoft, and Facebook achieve sustained business growth; in China, fine data operations and AB experimental analysis to drive effective business growth have gradually become a consensus and a core means. Among them, the A/B testing platform, as a typical representative, has naturally become an indispensable core tool for mainstream domestic companies, effectively improving the conversion efficiency of traffic and the iteration efficiency of production and research. In the past few years, vivo Internet has continued to attach importance to scientific experimental decision-making, which means that all changes to users must be based on the corresponding experimental conclusions. For example, modifying the background color of the top ad and testing a new ad click-through rate (CTR) prediction algorithm all need to be done through experiments, so a powerful A/B experiment platform is very important. The vivo Hawking Experiment Platform (hereinafter referred to as Hawking) has grown from a single system to a company-level one-stop platform for solving A/B experiment-related problems, helping the Internet's core businesses to conduct fast and accurate experiments and efficiently promote business growth. 2. Project Introduction2.1 A/B ExperimentIn the Internet field, A/B experiments usually refer to an iterative method that can guide how to improve existing products or services. Taking the improvement of the order conversion rate of a certain product as an example, during the AB experiment, we designed a new order page. Compared with the original page, the page layout and copywriting were adjusted. We randomly divided user traffic into two groups, A/B (corresponding to the new and old pages respectively), 50% of users saw the A version page, and 50% of users saw the B version page. After a period of observation and statistics, it was found that the order conversion rate of version A users was 70%, higher than the 50% of version B. So we concluded that version A was effective, and then pushed and displayed the new page to all users. The above is a specific application of using AB testing to iterate product functions. We divide the complete life cycle of A/B experiments into three stages:
2.2 Hierarchical experimental modelHawking's layered experimental model was designed with reference to the overlapping experiment framework paper released by Google: "Overlapping Experiment Infrastructure: More, Better, Faster Experimentation". 2.3 Platform development and its application and value in vivo businessHawking was launched in 2019. After more than three years of development, the average number of experiments per day has reached more than 900, and the peak number is 1000+.
3. Hawking System Architecture3.1 ExperimentersExperimenters include multiple roles, allowing business parties to conduct experiments, manage indicators, and analyze experimental results in the Hawking management background. 3.2【Experiment Portal】It includes two functions: experiment management and experiment effect analysis. 3.2.1 Experimental Management The platform provides a visual page for business parties to configure experiments, select diversion strategies, allocate traffic, and manage whitelists. 3.2.2 Experimental effect analysis It includes the following four core capabilities: 1. Indicator Management Different experiments focus on different indicators. In order to realize the automation of effect evaluation, the platform provides indicator configuration and integration capabilities. [Must-see indicators]: Usually core business indicators. Each experiment must ensure that there are no obvious negative indicators. The platform integrates a big data indicator management system, and these indicator results directly reuse the data services of the indicator management system. [Personalized indicators]: usually indicators for temporary analysis in experiments, such as a banner style experiment, observing the exposure and click-through rate of a specified banner. The platform provides the ability to configure custom indicators, and automatically generates computing tasks through the big data computing platform to achieve the ability to automatically output data for custom indicators. 2. Comparative analysis and significant conclusions The platform provides visualization components such as comparative analysis and significant conclusions for the visual display of effect evaluation. It can very intuitively inform the experimenter of the overall improvement of each experimental scheme compared with the control scheme and the daily increase or decrease. It also provides the confidence interval and significant conclusion of the indicator. 3. AA analysis The AA analysis provided by the platform is intended to help experimenters verify whether there are significant differences in core business indicators among the populations of people who actually enter the experiment with different schemes before the experiment, and to assist experimenters in determining whether the experimental conclusions are reliable. 4. Real-time monitoring of diversion The effect of real-time diversion can be intuitively seen, and abnormal traffic can be manually intervened and resolved in a timely manner. 3.3 [Experimental diversion service]1. Multi-terminal access service The platform provides rich access capabilities based on different business demands, such as Android SDK for Android clients, JAVA SDK for servers, H5 experimental services based on NGINX for traffic diversion, dubbo/http services, and C++ SDK to be built. 2. Experimental diversion method The platform provides stable and efficient online real-time diversion services.
This algorithm can ensure that the sample size of the population is uniform when grouping and the distribution of indicators in the population is also uniform. For details, please see this article: 4. Practice of vivo Hawking Experiment Platform → 4.1 Detailed introduction to the covariate balance algorithm.
3.4 [Diversion data processing service]The company's unified data collection components are used to collect and process diverted data, and finally store it in HDFS. 3.5 [Indicator Calculation Service]Independent services are used to efficiently calculate indicator results. They are also equipped with indicator calculation failure retry and monitoring alarm mechanisms, effectively ensuring the success rate of indicator calculation. 3.6 Data StorageMySQL is mainly used to store business data. Ehcahe is used as the main cache for the experimental configuration, and Redis is used as the auxiliary cache. Finally, the data diverted by the experiment is processed and saved in HDSF for subsequent experimental data analysis. 4. Hawking PracticeThe above introduces the development of Hawking and the overall system architecture. Next, we will introduce the problems encountered during the development of the platform and the corresponding solutions. 4.1 Covariate Balancing Algorithm4.1.1 Problems encountered When the business side groups experimental subjects, the most common practice is to hash a certain attribute of the experimental subject and then take the modulus of 100, and divide them into different groups according to the result. Although the hash algorithm can achieve uniform distribution of the tail number segment, after the grouping, there may be uneven distribution of experimental subjects in different groups in certain indicator characteristics, resulting in inaccurate evaluation of the experimental effect. As shown in the following figure (the four different colors in the figure represent different groups of people and corresponding indicator types): Is there a way to achieve uniform grouping of people while ensuring that the corresponding indicators of the people are also evenly distributed, as shown in the following figure: 4.1.2 Solution 1. Covariate Balancing Algorithm This algorithm can ensure that the population is evenly grouped and the corresponding indicators of the population are evenly distributed. The whole algorithm consists of three parts, as shown in the following diagram: (1) Offline stratified sampling The following three steps are required:
Here we introduce the proportional stratified sampling Equal proportion stratified sampling: The number of samples to be drawn from the i-th layer; the total number of samples from the i-th layer; N is the total number of available traffic; n is the number of traffic samples set for this experiment Assume N is 3kw and n is 50w. Classify according to the following dimensions: There are 9 combinations in total. Determine the proportion of each combination in the total amount (total N=3kw , by screening specific groups from all available traffic): The number of samples in each layer is calculated by the formula; the number of samples in the corresponding classification (total sample size 60w): So far, the entire offline stratified sampling work has been completed. Next, we will introduce real-time uniform grouping. (2) Real-time uniform grouping There are 4 steps to follow:
The overall flow chart of real-time uniform grouping is as follows: When performing real-time uniform grouping, we face performance and storage pressure. For this reason, we have designed a high-performance diversion solution and a high-memory utilization user information storage solution. High-performance flow control solution We use different redis data structures and Lua scripts to achieve uniform distribution of buckets under the layer Solution 1 Pre-allocate the sample size for each bucket, and select the bucket with the largest current sample size each time Solution 2 Pre-allocate the sample size for each bucket, and select the bucket with the largest current sample size each time Solution 3 Select the hit bucket by taking the current layer sample size and bucket size modulo Solution 1: It has the highest performance when there are only two buckets, which is 1.05 times that of Solution 3, but its performance decreases linearly as the number of buckets increases. After comprehensive consideration, choose option 3. Design of user information storage solution with high memory utilization Comparison of memory consumption of uid-layer Solution 1: Use redis string storage. Assume that the uid is 15 digits, the layer id is 2 digits, the expiration time is taken into account, the additional consumption in cluster mode is not considered, and the malloc memory fragmentation and occupancy are not considered. After comprehensive consideration, choose option 3. (3) Offline analysis and verification Because the experimental process of the covariate algorithm is relatively complicated, we still use the manual data collection method to analyze the experimental results. 4.2 Java SDK4.2.1 Problems encountered The early Java SDK had weak capabilities and only provided diversion. The access party was required to report the diversion result data, which caused a high cost for the access party. Therefore, the Dubbo interface was mainly used for external diversion services, and the platform server was responsible for reporting the diversion result data within the service. As more and more parties are connected, Dubbo thread pool is frequently exhausted or diversion fails due to network reasons, resulting in a poor diversion experience and affecting the analysis of experimental results. In the face of the above problems, in addition to continuously optimizing performance, the platform also needs to continuously expand the capacity of application servers and other resources, resulting in a certain amount of resource waste. 4.2.2 Solution In response to the above situation, the Hawking development team has conducted a thorough technical solution research and upgraded the Java SDK several times to successfully solve the above problems. At present, it has core functions such as experimental diversion, reporting of diversion results, real-time and incremental update of experimental configuration, and SDK self-monitoring, which greatly improves the stability and success rate of diversion. 4.2.3 SDK's six core capabilities 1. Reporting of diversion results The diversion results are reported within the SDK based on the company's data collection components. 2. A fallback plan for failure to report diversion results When reporting diversion, the data link cannot guarantee 100% integrity of the data. If there are machine crashes, business service abnormalities, network abnormalities, etc., the experimental diversion data reporting will fail, which will directly affect the experimental effect analysis. How to ensure that the experimental diversion data is 100% not lost under any circumstances? For this reason, the Hawking Experiment Platform has designed a solution for diverting data to disk as a backup measure for abnormal scenarios, thereby 100% ensuring the integrity of the data. The design diagram is as follows: 3. Experimental configuration real-time & incremental update In addition to pulling the experimental configuration to the local cache of the business party through scheduled tasks, real-time and incremental updates are also provided, which are suitable for businesses that have high requirements for the timeliness of experimental configuration changes. They can be controlled by switches and take effect dynamically. The default is real-time incremental update + regular full update, which is convenient for the business party to use flexibly. The following is a flowchart for configuring real-time and incremental updates: In the event of a real-time update failure, we have designed a failure backoff strategy: an exponential failure backoff strategy is used, with the default long polling interval of 1s. The interval increases by 2 times for each failure, with a maximum of 60, so the growth sequence is 1, 2, 4, 8, 16, 32, 60; the interval is set to 1 for each success. In addition, we have ensured the final consistency of data, ensuring that the SDK can eventually pull the latest configuration when pulling configuration, and there will be no configuration rollback:
4. Multi-level configuration management The SDK supports multi-level configuration management, with the priority being: method parameter configuration (original) > business configuration center grayscale configuration > business configuration center configuration > remote default configuration > local default configuration; business configuration center grayscale configuration means grayscale function by configuring the specified machine IP in the configuration center. 5. Traffic diversion strategy One of the pain points of using SDK is that after adding new functions, the business side needs to upgrade accordingly, otherwise the new functions cannot be used, which in turn affects the business. For this reason, Hawking has designed a backup solution. When the SDK detects that the newly added policy does not exist, it accesses the Hawking server through the dubbo generalized call method to ensure the normal diversion function; effectively ensuring that the business side has sufficient time to upgrade to the latest version and improve the user experience. 6. SDK monitoring alarm Another pain point of using SDK is that SDK is integrated into the process of the business party's server. The printed error information cannot be seen by the business party, and it depends on the feedback from the business party, which seems very passive and cannot follow up and solve the problem in the first time. To address this situation, Hawking Experiment Platform has designed a set of SDK self-monitoring solutions. After pre-aggregation according to time accuracy, self-monitoring data is reported to general monitoring through the tracking domain name. Self-monitoring supports domestic sales, Singapore and India environments. Through monitoring, we can intuitively see whether there is error information in each business, experiment, SDK version and other dimensions, and configure alarms according to the corresponding dimensions, so that developers can follow up and solve problems in the first time. 4.3 H5 Experiment4.3.1 Problems encountered When the industry conducts H5 experiments, the usual practice is to develop an H5 SDK and let the business side introduce it on the front end. There are several problems:
4.3.2 Solution So is there a simple and quick way to complete the access of the entire H5 experiment by only configuring the experiment in the background? To this end, the Hawking development team designed a solution to successfully solve this problem. The entire H5 experiment architecture is built based on the open source apisix. When the business party creates an experiment in the Hawking management background, all nginx-based routing configurations are automatically issued through the interface (cooperating with the work order review). The access party does not need to make any changes at the code level, which is non-invasive and greatly improves the efficiency of the business party in doing H5 experiments. Here are some nouns to explain:
(1) Overall timing diagram of H5 experiment (2) Switching from NGINX to VUA The public VUA proxies the pages that need to be experimented with to the Hawking VUA, which completes the experiment diversion through the Hawking Experiment Platform Diversion APISIX plug-in developed by the company. Multi-version experiment diversion 1) Introduction to H5 multi-version experiment When doing experiments with the same URL, through Hawking diversion, different users access the same URL, but the page access content is different (because the multi-version experiment publishes page version resources to different machines), and then access different resources is diverted through the Hawking experiment platform. 2) H5 multi-version experiment diversion principle
3) Flowchart
Multi-page experiment diversion 1) Introduction to H5 multi-page experiment Experiment with multiple different URLs. Through Hawking diversion, different users access different URL pages. 2) H5 multi-page experiment principle
3) Flowchart Hawking Experiment Platform Diversion APISIX Plugin The process diagram is as follows: Plugin development specification reference: https://apisix.apache.org/zh/docs/apisix/plugin-develop H5 Experiment Diversion Data Collection The diversion data of the H5 experiment is saved in the access_log of the Hawking VUA platform, and is finally stored in the DW table of the HIVE library through the following steps for subsequent data analysis. V. Experimental Effect AnalysisThis module includes indicator services, data analysis and effect display, quasi-real-time indicator calculation, AA analysis and other functions, which will not be expanded here due to limited space. VI. Summary and OutlookThis article mainly introduces the platform-based and product-based construction and practice of A/B experiments in vivo, achieving the following values and capabilities:
However, there are still problems such as user experience. In the future, we will focus on optimizing and improving the experimental process and data service functions such as indicator configuration (solidification of common indicators, simplification of indicator configuration) and data display (interaction optimization, multi-dimensional analysis, attribution analysis), so as to continuously improve the user experience. References:
|
<<: The Ultimate iOS Development Toolkit: Top 10 Essential Tools
>>: vivo official website App modular development solution-ModularDevTool
Baidu bidding promotion is a kind of online promo...
Faraday Future unveiled its first concept car at ...
1. SB Admin 2 Details & Download 2. Adm...
There are many benefits to working in advertising...
WeChat is the most commonly used chatting method ...
No one in the current technology industry can ign...
How to make profit after developing WeChat Mini P...
1. What is it (coroutines and Kotlin coroutines) ...
In order to test the waters, most partners simply...
When searching for relevant images on search engi...
Fitness Video - The most beautiful woman who lift...
Discussions about traffic have always been a hot ...
Many people think that 400 calls are free, and no...
Many advertisers will control their budgets very ...
Nowadays, with the continuous improvement of peop...