The method used by BAT, detailing the pitfalls of A/B testing!

The method used by BAT, detailing the pitfalls of A/B testing!

An important idea mentioned in the growth hacker model is "AB experiment".

In a sense, nature has already given us enough inspiration. In order to adapt to the changing environment, biological populations are undergoing genetic mutations every day. Ultimately, the survival of the fittest will occur, leaving behind the best genes. This ingenious biological algorithm is probably the most successful AB experiment arranged by the Creator.

Turning our attention to the Internet world, the importance of AB experiments is increasing unprecedentedly.

01 Let’s look at two famous cases

Case 1: Obama's publicity team used AB experiments to help him gain higher support

In 2008, Obama won the election and became the 44th President of the United States. This was inseparable from his personal charisma, but the role of his campaign publicity team cannot be ignored. On the presidential campaign page, his team used AB experiments to find the best solution among 16 solutions, increasing the conversion rate of "change" on the campaign page by 40.6%.

(Figure 1)

(Figure 2)

The experiment is designed as follows: any combination of the picture or video in Figure 1 and the different text buttons in Figure 2 forms 4*4 combinations of 16 different solutions. Each solution obtains a certain proportion of traffic . After observing for a period of time, the solution with the highest conversion rate is selected and promoted to all users.

In the end, the following solution won:

The explanation given by his team afterwards was: video playback puts a lot of pressure on users, and the network environment at the time could not guarantee the playback effect, so videos are not as good as pictures. In addition, Americans value family culture, and a warm family photo can bring them closer to voters. As for the button copy, it is because American voters have a high sense of independent thinking, and copy such as "join us" and "sign up" makes people feel very simple and crude, and there is suspicion of incitement, so everyone is more accepting of the peaceful "learn more".

Case 2: Facebook used AB experiments to save 20% of losses

In 2012, with the strong support of Zuckerberg, Facebook's product VP Sam Lessin led a team of 30 people and spent more than half a year developing a new version. According to the evaluation of some external users and internal employees invited before the launch, the new version is cool and fashionable, and much better looking than the old version. As shown in the figure:

(The picture above is the old version)

(The above picture is the new version)

Facebook is indeed a world-class Internet company, and it will definitely conduct AB experiments for major iterations. They first allocated 1% of the traffic to the new version, and then gradually increased it to 2%, 5%... The results of the experiment were beyond everyone's expectations. The new version seriously lagged behind the old version in four core indicators, including user engagement, online time, number of ad impressions, and revenue. At first, everyone thought it might be because users were not used to it, but as the traffic of the new version increased to 12% and the observation time was extended to 3 months, the situation was still very bad. The new version directly led to a 20% drop in revenue. In the end, Facebook took drastic measures and asked all users to roll back to the old version, thus restoring the previous data.

In China, a well-known social networking site for college students saw Plan B, which was being experimented with a small amount of traffic, and directly copied it and quickly launched it with full traffic. You all know the result. By now, this website has completely become a third-rate Internet product.

This shows that a failed product plan is not scary; what is scary is the company system and culture that goes online directly without going through AB experiments.

Let’s take a look at an AB experiment case in a domestic first-tier company!

The above is an AB experiment of different guide card styles. The final result shows that style 2 increases the CTR by 24.8% compared to style 1.

02 Does your team have such problems?

1. The full traffic is directly launched online without going through AB experiments. After the launch, members desperately look for data to prove that they are right. Even if the evidence is far-fetched, as long as the statement released to the outside world is the same ##the indicator has been improved again##, and everyone gives thumbs up. You should know that the experience of Google , Facebook, and Microsoft in doing AB experiments is that 90% of the new designs are not as good as the online versions. Even if your team is awesome, it can't beat Google, Facebook, or Microsoft, right?

2. Your team has a lot of ideas, but everyone sticks to their own opinions and no one can convince anyone else, which makes team decision-making very difficult.


The change of the team started with the first AB experiment - whose plan is better, whose plan can be fully promoted. Instead of arguing, it is better to arrange an AB experiment and use data to compete.

The following article explains in detail the basic concepts and common pitfalls of AB experiments.

03 What is AB experiment?

For example, you proposed a product improvement plan (assuming it is called B), but you are not sure whether the effect is better than the online version (assuming it is called A), so you allocate 1% of the online user traffic to B and 99% of the traffic to A, and observe for a period of time. If B is better than A, push B to 100% of the traffic. If A is better than B, then modify your design plan and do the experiment again. If you launch a new solution without conducting AB experiments, as in the case of Facebook, the new solution may even ruin your product. The solutions here may be a set of algorithms, a set of copywriting, a set of operational activities, or a set of UI styles. At the same time, the experiments may not necessarily be AB solutions, but may be ABCDE... experiments.

04 Problems encountered in AB experiments

The implementation of AB experiment is certainly not as simple as the above example. For example, you may encounter the following problems:

1. How to ensure that the characteristic distribution of 1% of traffic is consistent with that of 99% of traffic user groups?

2. If there is a new solution idea C during the experiment, can it be directly published online for simultaneous experimentation?

3. How to run multiple experiments in parallel with a total flow rate exceeding 100%?

4. How to select indicators to measure the AB plan? If the data of multiple indicators perform differently, how to make decisions?

5. How to determine whether the different indicator values ​​between Plan B and Plan A are caused by random errors or are statistically reliable?

The basic principle of AB experiment is the "control variable method".

Assume that the indicator value = F ({hidden variable column}, {explicit variable column (including solution variables)}). The data performance of an indicator is jointly determined by the function F and the values ​​of multiple variables, so the indicator measurement results cannot be simply attributed to the differences in the schemes, especially since there are many hidden variables that we will never know that have an impact.

So do we need to know F and all the variables before we can draw conclusions? There is an easier way. We can ensure that the other variables in the two solutions remain consistent, so the difference in indicator results between Solutions A and B can only be attributed to the difference in versions. The AB experiment utilizes the idea of ​​the controlled variable method to ensure that each product solution is tested on a homogeneous population (with the same characteristic distribution) at the same time, ensuring that all variables except the solution variables are consistent. Therefore, it can be determined that the difference in indicators is caused by different solutions, and the winning version can be selected to be launched on full traffic to achieve data growth.

AB experiments are very useful, but the implementation of AB experiments is not simple and often leads to numerous pitfalls.

05 What are the pitfalls of AB experiments?

1. Different people

AB experiments require dividing traffic into different plans. If the traffic cannot be divided correctly so that the characteristics of the user groups allocated to different plans are consistent, then the experiment will be meaningless. To make it easier to understand, let's look at an example:

If we want to do an AB experiment on group G to find out what gifts can increase the user registration conversion rate? A and B represent the distribution of different prizes, BB cream and razor, respectively. G is composed of subgroups G1 and G2 (G1 and G2 represent girls and boys, respectively, and each accounts for 50%). According to the requirements of homogeneous users, the ratio of male to female in the user traffic allocated to these two plans must be consistent with the overall ratio, that is, female:male = 1:1.

At this time, something unexpected happened...

In the experiment, unfortunately, the group assigned to Plan A were all G1 (girls), and the group assigned to Plan B were G2 (boys). In the end, one prize had a higher registration conversion rate than the other prize, for example, A was higher than B. So can we conclude that "Prize A is more popular with users than Prize B, and Prize A should be given to all users"?

Definitely not. This decision is equivalent to assuming that what girls like is what boys like. According to the experimental conclusion, you should give gift A with a higher registration conversion rate to all users G. Just imagine how boys feel when they receive BB cream?

The problem here is that the people allocated under different plans are of different natures. The examples given above are relatively absolute for ease of understanding. In actual practice, we often encounter situations where both Plan A and Plan B involve mixed groups of men and women, but the ratio is different from the overall 1:1 distribution, which also leads to erroneous experimental conclusions.

Therefore, designing a reasonable diversion algorithm to ensure that the people diverted to each plan have the same characteristic distribution is a prerequisite for the credibility of the AB experiment conclusion. After more than a year of exploration, the Darwin AB experimental system has formed a relatively reliable diversion algorithm.

2. Experiments at different times

In the above example, if both Plan A and Plan B are assigned to group G with the same characteristic distribution, are the data necessarily comparable? uncertain. Let’s use extreme examples to help us understand. Suppose on the first day, Plan A gets 1 million user traffic and Plan B gets 0 user traffic. On the second day, Plan A gets 0 traffic and Plan B gets 1 million user traffic. Overall, the cumulative experimental traffic of Plan A and Plan B on these two days is 1 million, and the population is homogeneous. The experimental results should be credible. However, the reality is contrary to expectations. If this is a social networking site, and the experiment is to observe the number of users who actively add friends under different product versions A and B, then Plan A has a much greater advantage. After all, users have an extra day to add friends. In this case, B is at a disadvantage in any time section data, and this disadvantage is not caused by the different plans. Similarly, a blog site may make the same mistake if it compares the blog opening and writing rates of users under different plans.

Another situation is that on some special days, the user's activity will temporarily increase. If Plan A happens to be effective on a holiday, and Plan B is not effective on a holiday, then this comparison is obviously unfair to Plan B.

The formula mentioned above: "Indicator result = F ({hidden variable column}, {explicit variable column (including program variables)})", a large part of the hidden variables and explicit variables are related to time. The values ​​of these variables are different at different times, which destroys the premise of the control variable method and makes it impossible to draw correct experimental conclusions.

Finally, let me give you a case we participated in to give you a feel for it:

Copywriting for Style 1: "Sunflower Manual" helps you use XXX easily
Copywriting for Style 2: I’ll tell you which features are the most popular

Since the experimental management standards were not standardized in the early stage, the two types of experiments were not started at the same time:
1. Style 1, start the experiment at 10:00 on April 7
2. Style 2, start the experiment at 0:00 on April 7

The final statistics showed mixed results:
If we look at the data of users who entered the experiment after 10:00 on April 7, the CTR of style 2 is only about 0.3% higher than that of style 1, which meets the premise of the experiment, so the conclusion is credible;
However, if we look at the data for the whole day of April 7, the CTR of Style 2 is about 1% higher than that of Style 1. This does not meet the conditions of the experiment we mentioned, and the conclusion is not credible.

Here we are also told:
1. All experimental versions to be compared (style 1 and style 2 above) must be started at the same time
2. During the experiment, the traffic of each version cannot be modified at will, which will also indirectly lead to the above problems

3. No awareness of AA experiment

The AA experiment is the twin brother of the AB experiment, and some Internet companies also call it the idle experiment. AA means that all the schemes in the experiment are consistent. What is the purpose of doing this? This is to test the accuracy of point placement, traffic diversion, and experimental statistics, and to increase the credibility of the experimental conclusions of AB experiments.

Suppose Proposition 1 is: "If there are no problems with the experiment's tracking, diversion, and statistics, then the data performance of each scheme in the AA experiment must be consistent." If Proposition 1 is true, then its converse Proposition 2: "If there are significant differences in the data performance of each scheme in the AA experiment, then there must be a problem with at least one of the tracking, diversion, and statistics of the experiment." must also be true.

Strictly speaking, the passing of the AA experiment cannot prove that there are absolutely no problems with the above three items (point embedding, diversion, and statistics), but the failure of the AA experiment can definitely prove that there is a problem with at least one of the above three items.

Therefore, a team with AB experimental literacy will definitely arrange AA experiments before AB experiments.

4. Experimental reversal

If an experiment is set up online on the first day and plan A is better than plan B, does it mean that the data performance will be the same on the second and third days?

When users enter a new solution, they are likely to be more active out of curiosity, but as time goes by, they gradually calm down and the data performance returns to what it should be. If the experimental observation period is set too early, it is easy to draw wrong conclusions. The same is true in reverse. Some users are not used to the revised version, but after becoming familiar with it, they find that it is more convenient than the old version, and the data will gradually recover.

On the other hand, if the sample size for the experiment is too small, reversals may also occur. The frequency of heads when tossing a coin 100 times and when tossing a coin 1 million times is likely to be different. According to the law of large numbers, as the number of random experiments increases, the frequency distribution of the random variable tends to its probability distribution. Here, suppose that only 100 users enter on the first day of the experiment. Since the sample size is too small, the randomness of the experimental results is too strong. As the number of days increases, the experimental sample also increases, and the experimental results may be reversed.

Generally speaking, we do not recommend conducting AB experiments on products with a sample size of less than 1,000 users, as the experimental results are difficult to guarantee.

5. Hysteresis effect

Remember when we were having chemistry experiments, the teacher must have asked you to clean the test tubes first, right? This isn't just for hygiene reasons. If the chemical agent to be tested is mixed with the residual agent in the test tube, then what is actually being experimented on is this "mixed agent", and the experimental results are certainly unreliable. The problem mentioned above is carry over - the lag effect.

The same problem also exists in Internet product experiments. For example, users numbered 00001-10000 and 10001-20000 were previously divided into different experimental plans (A and B) for experiments. After the experiment, the team started a new experiment. If there is no special treatment, users 00001-10000 and 10001-20000 may also be divided into two plans (A1, B1). Are the experimental results credible at this time? Users from 00001-10000 have previously experienced Plan A, and now all fall on Plan A1. Users from 10001-20000 have previously experienced Plan B, and now all fall on Plan B1. Perhaps the two user groups were homogeneous before the first experiment, but after the first experiment, the two groups are no longer homogeneous. To conduct the second experiment, a certain algorithm must be used to break up the two user groups again, obtain a new numbering arrangement, and then split out two homogeneous groups for the second experiment, or take out a new number segment for the experiment, such as 20001-30000, 30001-40000.

The above are just some common sense pitfalls, and there will be more in the actual process...

Among China's top Internet companies BAT , AB experiments have become very common. Baidu has thousands of AB experiments running in parallel, and Alibaba and Tencent also have their own AB experiment systems to support large-scale parallel AB experiments in multiple businesses.

"How to root AB experimental culture into the company's genes?" This is the question that the times are asking all Internet companies.

The author of this article @范磊 is compiled and published by (Qinggua Media). Please indicate the author information and source when reprinting!

Product promotion services: APP promotion services, advertising platform, Longyou Games

<<:  Murphy's Law: 20 ways to avoid bad luck in life

>>:  Three Lives and Three Worlds of New Media Promotion: No Peach Blossoms, Just Routine

Recommend

ToB products: How to use live streaming to achieve fission + conversion?

Many friends asked us how we achieve conversions ...

Did “earthy marketing” make Mixue Bingcheng successful?

Introduction: As of June 19, Mixue Bingcheng has ...

Baidu Information Flow Delivery Manual

Baidu Information Flow is one of the main channel...

Product growth marketing is a process, not a magic bullet

Founders understandably tend to underestimate the...

Color Master Class: 8 modules to master color grading skills

Color Master Class: 8 modules to master color gra...

How to develop overseas promotion and promotion channels from 0 to 1?

The key to success in gold mining is to seek drag...

WeChat operation: Master all the skills of WeChat push in 10 minutes!

1. Article Overview Core content: WeChat push ope...

As an operator, how to quickly build a systematic knowledge system

Why do we need a systematic system? Because opera...

The correct approach to enterprise short video operation

First of all, for enterprises, it is obviously no...

Apple's App Store search rules have become dramatically more equal

Last week, the news that Apple CEO Tim Cook came ...

What can become popular on Tik Tok?

1. Grasp the best time for hot events Friends who...