The sixth episode of the Aiti Tribe live class: Lean Data Analysis - How to make your company have the same analytical capabilities as BAT

Every enterprise hopes to build a large and comprehensive big data platform, but practice has proved that sustainable big data platforms are gradually established through the theory of lean data analysis. The theory of lean data analysis is to establish the smallest business closed loop, gradually verify and expand the data analysis platform, and finally achieve the same data analysis capabilities as BAT. Among them, core technology and business analysis goals will encounter various challenges as they continue to grow. Today, iResearch CTO Guo Wei shared the lean construction ideas in the construction of big data platforms by enterprises and the growth process of building a big data analysis platform with 520 million monthly active users.

The main sharing content is as follows

1. Lean data analysis 2. Common lean data analysis scenarios 3. Iteration and expansion of big data technology framework 4. User lean analysis to big data platform

Hello everyone, I am Guo Wei, CTO of Analysys International. I am very happy to share with you today. I hope you can gain something from my speech. The topic of my speech today is Lean Data Analysis - How to make your company have the same analytical capabilities as BAT.

Let me briefly introduce myself:

Mr. Guo Wei joined Analysys in 2016 as CTO. He built the Analysys technology team to complete the technical architecture and system of Analysys big data collection, platform, data mining, etc. He built the Analysys hybrid cloud, upgraded the Analysys SDK and released the Analysys Miaosuan real-time computing platform from scratch. Currently, the Analysys big data platform processes 30T of data per day, 25.2 billion records, and has 520 million monthly active users.
Mr. Guo Wei graduated from Peking University. Before joining Analysys, he served as the Big Data Director of Lenovo Research Institute and the General Manager of Wanda E-commerce Data Department. He also held important positions in the field of big data at CICC, IBM, and Teradata. He has unique insights into the cutting-edge research of big data, including video, smart WIFI and other big data soft and hard data integration technologies.

1. Lean Data Analysis <br> First, let’s talk about the origin of lean data analysis thinking - Lean Startup. Lean Startup was first proposed by Silicon Valley entrepreneur Eric Rise in his book "The Lean Startup" in August 2012.
Three key points: minimum viable product (MVP), customer feedback, and rapid iteration.

What is Lean Data Analysis?
The core of lean analysis is to start with the smallest closed loop of business, form a closed loop of business effects each time, achieve business goals, and then expand the next step of big data analysis content, or establish related systems, or establish related platforms.
• Optimize the minimum viable product instead of setting hard targets vs. the decision-makers saying “we want to build a big data project”
• Keeping pace with end customers and business vs. “platform first then business”
• Close the business loop and form data analysis for big data vs. “Management sees the dashboard”
• Speed increase/transformation/innovation - the biggest challenge lies in the change of corporate culture. Among these, the former must be given priority. Based on my more than ten years of experience in the data industry, I believe that we should not blindly pursue big data for the sake of big data. Even if a big data platform is established, it will not last long. We must strategically establish a lean big data platform.

Important things must be repeated three times: do not pursue big data aimlessly for the sake of big data. Even if a big data platform is established, it will not last long. You must establish a lean big data platform strategically.
Don’t just aimlessly pursue big data. Even if a big data platform is established, it will not last long. You must strategically establish a lean big data platform.

Don't just aimlessly build big data for the sake of big data. Even if a big data platform is built, it will not last long. You must strategically build a lean big data platform. So how to build it? I personally suggest starting with Internet/mobile Internet user operations, because this area has obvious pain points in recent years and business closed loops are easier to find.
As you all know, after the Internet entered the second half, the days when we could attract a large number of new users by making an APP without any activities are gone forever. Now even the effect of attracting new users precisely may not be good. Therefore, how to further operate these existing users has become the main business demand now.

As you can see, China's population growth is no longer a few percent per year, but is increasing by a few tenths per year. Similarly, the growth of mobile Internet users is also slowing down. So now it is not about how to get new things, but how we can retain users and increase their income.

Difficulty in acquiring customers, inability to retain users, and inability to extract value are the three major challenges facing Internet operators today.

The user life cycle management under lean data analysis is an important starting point: precise marketing when acquiring customers, improving channel ROI, increasing ARPU among mature users, and using various conditions to retain users when they leave. This requires various analyses of user behavior, attributes, channel characteristics, and loyalty.

Among them, customer acquisition, retention, and conversion are the main requirements for lean data operations. The figure lists various indicators of data analysis that need to be done for your reference.

How to control the pace of business growth driven by big data? I suggest four steps. First, unify users and members internally (it is recommended that enterprises sort out this part by themselves, as only enterprises themselves can have the clearest view of various data); second, build/purchase an Internet user life cycle management platform by yourself - this is the fastest way to see results and is in line with the lean thinking; third, build an enterprise big data platform to connect the Internet with internal systems; fourth, use your own digital assets to build data services or further upgrade the enterprise's artificial intelligence platform.

2. Common Lean Data Analysis Scenarios

Below we share some commonly used lean data analysis scenarios.

Lean data analysis, facing the user, the core methodology of user life cycle management is the AARCE model. There are many analyses to be done in each step. Let me give you an example of a common scenario:

Finding high-quality channels, improving key path conversions, recovering lost users, and improving user retention and activity are some of the most common lean analysis models.

For the operations and marketing departments of every company, finding the right channels and developing users are issues that they face every day. Measuring the quality, conversion, and retention of each channel is a typical lean data analysis scenario.

When measuring channels, data analysis can be done from the perspectives of new additions, retention, and anti-brushing. Most channels will have some water, whether they are self-built or purchased. Helping companies save channel costs and finding more suitable channels will directly allow management to feel the role of big data. My personal experience is that the closer the business loop of data analysis is to money, the easier it is to gain company recognition. Channel development alone is not enough, and user conversion needs to be improved. Here are some commonly used indicators and methods for your reference

This is a problem that every product manager will encounter.

Each key path needs to be analyzed for conversions to see which users stay and which users leave. More importantly, it is necessary to see whether the users who leave go to competitors or whether the users who stay are our target customers.

This requires each company to build its own user portrait system and gain a panoramic user behavior insight into lost customers. When it comes to churn, each company will have a very typical function when building a lean big data analysis platform, which is to recall lost users. Generally speaking, it is necessary to first define lost users-->analyze the reasons for churn-->marketing activities for churn-->evaluate the effectiveness of marketing activities.

For each activity, whether it effectively reaches the group you defined and whether it effectively retains customers needs to be carefully evaluated. I briefly talked about some scenarios above. In fact, there are many such examples. Every practitioner needs to design some scenarios according to the scenarios of his own company.

3. Iteration and expansion of big data technology framework

Next, I will talk about the technical pitfalls that need to be filled in lean big data analysis. Every data analysis is actually done from collection-->reception-->calculation-->query-->mining-->service.

Let me talk about my experience at Analysys. Currently, public and private clouds are very popular. However, I chose the hybrid cloud provided by the supplier, which has both the scalability of public clouds and the performance guarantee of private clouds. Currently, the monthly active users of Analysys SDK are 520 million, and the daily active users are 78 million. This hybrid cloud architecture supports such a large data scale, operates every day, and provides normal operation for Analysys internal analysts and external products. It has been two years now, so I highly recommend that those who work on the underlying architecture try the hybrid cloud model.

Here are some of the advantages of hybrid cloud. It is not enough to have an underlying architecture. For such a large amount of data, the receiving method needs special optimization, and the cloud + terminal control strategy is particularly important. If it is not done well, hundreds of millions of devices will form a DDoS every day and crash your server cluster.

Here are some strategic choices for data collection and data reception, as well as the technical frameworks and modules that general data collection should have for your reference. These frameworks can support hundreds of millions of monthly active users, so you can use them with confidence. Time is running out, so I will pick out two of the biggest pitfalls in big data processing and querying.

One is our internal demand. We need to select users with some tag features to see what their user behavior characteristics are: for example, the top 5 apps that are frequently opened between 10:00-11:00 pm by women born after 1995 who love watching videos. The logical structure of data storage is very simple. One is the user tag table, user ID, tag ID; the other is user ID, timestamp, app name. The simple idea is to join and where orderby. But you should know that Analysys has 2.19 billion user portraits, 25.2 billion user behaviors per day, and hundreds of billions per month. How can it be solved with a simple join? Every company will encounter similar situations. My suggestion is to join! In a big data environment, do not use join to solve any problems. First use ES to filter users, then convert the user behavior filter vertically and horizontally into a bitmap, and then calculate the maximum result through the AND or OR relationship. Interested friends can discuss it separately. I can’t talk about it in depth today.

Another is the problem of orderly conversion funnel, which is the specific example I gave earlier. Everyone wants to know how many users browse products --> place orders --> pay. They must come in order. They cannot pay first and then browse. It is difficult to solve this problem using big data, because user behavior will be very large. How to find an orderly conversion combination and return in seconds is a very challenging problem. Some time ago, I also organized an OLAP competition. Many talented people and companies participated in the competition on this issue. The first place in the open source group also won a prize of 100,000 yuan. Here I give a simple idea for your reference and study. I will hold such competitions starting from July 2018, and everyone is welcome to play.

Of course, technology is endless, and there is another important type of technology that we will gradually iterate.

4. User Lean Analysis to Big Data Platform

***Time is running out, so I will simply share with you the big data platform within Analysys, hoping it will be of some inspiration to you.

In terms of data storage, Analysys uses HDFS, Spark, and Hive, as well as presto and greenplum. A comparison of these open source big data storages is shown below.

What needs to be emphasized here is that you should not only focus on the big data storage platform, but also the resource scheduling platform and data governance services. Time is running out, so you can learn more offline or search my past articles.

***You are also welcome to visit ark.analysys.cn. Experience Analysys' big data services. I would like to emphasize that big data analysis is only a process, not a result. Only lean analysis that forms a closed business loop is the path to sustainable development. The picture shows my WeChat and Weibo. Welcome to follow me.

The following questions are from the 51CTO developer community friends and sharing

Q: Dongying Daily-Zhidao: Teacher Guo, many units now require big data, but the concept is relatively vague. Do you have any good ideas, whether from the technical or product aspects, to explain them clearly to your leaders or colleagues?
A: Mr. Guo Wei, CTO of Analysys: I think big data is indeed easy to use. I will give you some references to the lean ideas in the first half. You must find a closed business loop and what business problems do you want to solve by using big data. I will give you some references to the lean ideas in the first two parts. I also recommend you two books, one is "Lean Entrepreneurship" and the other is "Lean Data Analysis". Many of the ideas in today's PPT also benefit from the inspiration Eric gave me.

Q: Dongying Daily - Zhidao: Thank you very much. We are a newspaper company. Our leaders are very interested in big data and asked us to come up with a plan. We are helpless. In fact, this is also the demand of the industry. Every industry has its own data. If it is mined and used, it will be a good data analysis. However, it is difficult for us to come up with such a plan. Does Analysys have such a plan?

A: Mr. Guo Wei, CTO of iResearch: Let’s add each other and have a private chat about our specific needs.

Q: Data-unicorn-Beijing: For private deployment, will secondary development be authorized?

A: Mr. Guo Wei, CTO of iResearch: Of course.

Q: Wang Jun-Beijing-Hadoop: I am using HBase+Phoenix for OLTP query. Now it is very slow to join a KW-level table with a 100,000-level table. It takes 30 seconds. How can I optimize this? I use HBase+Phoenix for OLTP and Hive on Spark for OLAP. After the data of OLAP is processed, it is put into HBase for query. Now the problem is that the OLTP query is very slow. The dimension is not fixed. I would like to ask how to optimize HBase+Phoenix. Now the problem is that it is slow to query HBase data through Phoenix. It takes 40 seconds to join a KW table with a 100,000 table. This is definitely unacceptable. The key is basically a combination of several fields. Now the analyzed data is put into HBase and needs to be queried in HBase.

A: Mr. Guo Wei, CTO of Analysys International: Do you use Hadoop? I suggest you try Greenplum.

A: Data-unicorn-Beijing: It is recommended to analyze the application scenario before selecting a database. If the dimension is not fixed and the query needs to be fast, MongoDB is a good choice. If it is data processing, such as join, hive has obvious advantages, or use hive for storage and Presto for calling (it is not very mature yet and has many hidden problems, such as data types, etc.).

A: Half Development - Little Star - Guangzhou: This cannot be completely blamed on the database. First of all, indexes and SQL optimization should be ruled out. In my impression, the bottleneck of MySQL data should be around 3kw, and pg is a little more. Of course, it also depends on the writing of the where condition. For example, or, <>, calculations on the left side of the expression, etc., will make the index invalid.

[51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites]

<<: Live | B&Q front-end architect Chen Guoxing: How to build isomorphic applications using React

>>: The habit of paying for genuine software is becoming more common: the era of iOS jailbreaking is over

Douban traffic promotion, how does Douban attract traffic?

Blog

Red Business School·Xiaohongshu Blogger Training Camp 3.0, practical operation can easily earn more than 10,000 yuan a month

Blog

Is it reliable to use E=mc² to explain that plane crashes are caused by planes hitting birds? How to make popular science more scientific?

Blog

Annual server rental bandwidth cost for short videos

The sixth episode of the Aiti Tribe live class: Lean Data Analysis - How to make your company have the same analytical capabilities as BAT

Douban traffic promotion, how does Douban attract traffic?

Red Business School·Xiaohongshu Blogger Training Camp 3.0, practical operation can easily earn more than 10,000 yuan a month

Is it reliable to use E=mc² to explain that plane crashes are caused by planes hitting birds? How to make popular science more scientific?

Annual server rental bandwidth cost for short videos

Three major issues to pay attention to and principles to follow when placing advertisements

In order to make iPhone X better for web browsing, I added a navigation button to the web page

China Chamber of Commerce for Import and Export of Medicines and Health Products: China's API exports in 2019

She replaced Musk, and the position of Tesla chairman was officially replaced

Be careful! If your rice cooker has this problem, don’t use it!

User operation, how to formulate user rebate strategy?

Recommend

Author of "Swift and Cocoa Framework Development": Your programming starts with Swift!

What should we learn from the WeChat ads on Internet TV?

How to do live streaming sales from 0 to 1

TVOS of the Broadcasting and Television Industry: The Road to Monopoly Is Not That Easy

How to create an app with 500 million users?

Community monetization: the unbearable pain of community operators

It won’t degrade for 450 years. How harmful is it to discard masks carelessly?

Often stay up late, and your ears suddenly become deaf? If you have these bad habits, change them quickly

How can we measure the speed of light when it is so fast?

Roewe RX5 upgrades to super-large panoramic sunroof and 1G data for free

How can traffic be doubled relying solely on content? It's time to think about your KPIs!

World Leprosy Day丨Preventable, curable and not scary!

Sound really can travel through a vacuum! It's time to rewrite your physics textbook

ANZ: Organoid Industry Research Report

Why do only Chinese people like to drink hot water? Is "drinking more hot water" really useful?