Build a complete product data operation system in 11 steps!

Many people's understanding of data operations is limited to digital statistics, cause analysis, etc. In fact, these are only a small part of data operations work. Data ultimately serves the product. The focus of data operations is on operations, and data is just a tool.

What does data operations do? My personal understanding is:

Set product goals, create data reporting channels and rule processes, observe product data, make data warnings, analyze the reasons for data changes, optimize products and operations based on analysis results, and predict future data trends to provide a basis for product decisions and integrate data applications into product planning and operations.

In simple terms, data operations should clarify the following five issues:

What should we do? ——Target data formulation;

What is the current situation? ——Industry analysis and product data report output;

What are the reasons for the data changes? ——Data early warning, analysis of reasons for data changes;

What will happen in the future? ——Data prediction;

What should we do? ——Product application of decision-making and data.

How can we build a complete product data operation system? Blues sorted out and summarized his experience working at YY. The whole process can be divided into the following 11 steps for your reference.

11 steps to build a complete product operation data system

Step 1: Establish product goals

This is the starting point of data operations and also the standard for evaluating the product after it goes online, thus forming a closed loop. When setting goals, you must not make a random decision. Instead, you should make the final decision based on comprehensive calculations such as business development, industry development, competitive product analysis, product development trends in previous years, and product conversion rules. The SMART principle is often used to measure goals.

(1) S stands for Specific

It means that work indicators should be specific and measurable, and cannot be general. For example, when we set the product goal of YY voice basic experience, if it was to improve the product experience, it was not specific enough and everyone’s understanding was inconsistent. At that time, our basic product goal was to increase the next-day retention of new users, which was very specific.

(2) M stands for Measurable

It means that the performance indicators are quantitative or behavioral, and the data or information to verify these performance indicators are available; to improve the next-day retention rate of new users, specific values need to be given.

(3) A stands for Attainable

It means that performance indicators can be achieved with effort, and we should avoid setting goals that are too high or too low. The next-day retention rate of newly registered users is not something we arrive at on a whim. At that time, we set a relatively challenging goal based on the historical data of YY's new users' next-day retention rate and the industry reference values of the new registered users' retention rate of game users. We increased the next-day retention rate of newly registered users from 25% to 35%.

(4) R stands for Relevant

It is related to other work goals; performance indicators are related to the job; the next-day retention rate of new users is closely related to user behavior, such as the user's recognition of voice tools, the user's preference for the content of the YY platform, etc., so the next-day retention rate of new users is strongly correlated with product performance and content popularity.

(5) T stands for Time-bound

Focus on specific deadlines for completing goals.

The product goal can be formulated as follows: before December 31, 2013, increase the next-day retention rate of newly registered YY Voice users from 25% to 35%.

The increase in the next-day retention rate of new users means more active conversions of users, driving the growth of the overall number of active users.

Step 2: Define product data metrics

Product data indicators are specific numerical values that reflect the healthy development of products. We need to give clear definitions of data indicators, such as data reporting methods, calculation formulas, etc.

For example, the next-day retention rate above can be defined as: the next-day retention rate is a ratio, the denominator is the number of YY accounts that are newly registered and log in to the YY client on the same day, and the numerator is the number of YY accounts in the denominator that log in to the YY client again on the next day.

Pay attention to the details here. The first and second days need to have clear time points, such as 0:00 to 24:00, which is calculated as one day. The problem is that a new user registers and logs in to the YY client at 23:00 on the first day, and logs off at 1:00 in the morning of the next day. According to the above definition, this user may not be recorded as a next-day retained user because the data reporting details are not clearly defined here.

The definition is logging into the YY client again on the second day. The user in the above case did not log in on the second day, but he was indeed a user who was logged in for two consecutive days.

Therefore, for this definition, additional details are needed: user login status, if the heartbeat packet is reported every 5 minutes, then the new user can be reported as a logged-in user for the next day. If the user goes offline before 00:05 and continues to be not logged in until 24:00 the next day, he will not be recorded as a retained user.

We select data indicators based on product goals. For example, for web products, we often use data such as PV, UV, crash rate, average PV per person, and length of stay to measure products. Defining a product indicator system requires consensus among various teams such as product and development. The definition of data indicators is clear and well documented, and will not cause differences in understanding of data interpretation.

Step 3: Build a product data indicator system

Based on the proposed data indicators, we summarize and organize the indicators according to product logic to make them more organized.

The next-day retention rate of new users is a core goal we set, but in fact, it is not enough to only look at the next-day retention rate. We also need to comprehensively examine the various factors that affect user retention rate in order to more accurately understand the healthy development of the product. As shown in Figure 1, this is a commonly used indicator system, including: new users, active users, payments, and other data.

Figure 1 Common data indicator system for Internet products

When we are making the YY voice client product, we will use the following indicator system, including four aspects: account system, relationship chain data, status perception data, communication ability, etc. Specific indicators include: the number of friends, the duration of watching channel programs, the duration of IM chats, the switching and duration of personal status, etc., as shown in Figure 2:

Step 4: Propose product data requirements

The establishment of a product indicator system is not achieved overnight. Product managers will put forward data requirements with different focuses based on the different stages of product development. Generally, companies will have templates for product requirement documents to facilitate communication between colleagues in product and data reporting development, data platform and other departments, and to carry out data construction. For entrepreneurial SMEs, the process from proposing product data requirements to reporting them may only require 1-2 people, but it is also recommended to build data documents, such as the definition of data indicators, data calculation logic, etc.

Figure 3 is the basic product data requirement implementation process established by BLUES in the YY voice client team.

Figure 3 YY Division Basic Product Data Requirements Implementation Flowchart (Implementation)

Step 5: Report data

This step is to develop according to the product manager's data requirements, follow the data reporting specifications, complete the reporting development, and report the data to the data server. The key to reporting data is the construction of the data reporting channel. When I worked at Tencent, I did not experience the difficulty of this link, because the data platform department had already built a complete data channel. Development only required certain rules and the use of a unified data SDK for data reporting.

Later, when I worked at YY, a development-oriented company, we started building the reporting channel, which gave me more opportunities to practice and improve myself. One of the key links is the data reporting test, which once caused unnecessary trouble due to the lack of testing resources for this link.

Many startups do not have their own data platforms, so they can use third-party data platforms: for web products, they can use Baidu Statistics (tongji.baidu.com); for mobile products, they can use platforms such as Umeng (www.umeng.com) and TalkingData (www.talkingdata.com).

Steps 6-8: Data collection and access, storage, scheduling and calculation

Each step is a science. For example, data collection involves interface creation, which requires consideration of the extensibility of data fields, the ETL data cleaning process during data collection, and the correctness verification of client data reporting. Data storage, scheduling, and computing are even more challenging technical tasks in the era of big data.

1. Data collection and access

ETL is the abbreviation of Extract-Transform-Load, which is used to describe the process of extracting, transforming and loading data from the source to the destination. The term ETL is more commonly used in data warehouses, but its application is not limited to data warehouses. ETL is an important part of building a data warehouse. Users extract the required data from the data source, clean the data, and finally load the data into the data warehouse according to the pre-defined data warehouse model.

The figure below is a common flow chart of a product data system. Data collection, storage, and calculation are usually completed in the data center in the figure.

After confirming the data report, the next few things are more technical. First of all, we need to figure out how the reported data should be collected and stored in our data center.

Data collection is divided into two steps. The first step is to report from the business system to the server. This part is mainly through cgi or background server. After a unified logAPI call, the original flow data is aggregated in the logServer for storage. When the amount of data becomes large, you need to consider using distributed file storage. The commonly used external distributed file storage is mainly HDFS. I won’t go into details here.

Figure 5: Architecture diagram of raw data reporting and storage in files

After the data is stored in the file, the second step is to enter the ETL stage. ETL refers to the process of extracting, transforming, and loading the log from the text, cleaning it based on the analysis requirements and data dimensions, and then storing it in the data warehouse.

Take Tencent as an example: Tencent's big data platform now mainly supports massive data access and processing from two directions: offline and real-time. The core systems include TDW, TRC and TDbank.

Figure 6 Tencent data platform system

Within Tencent, data collection, distribution, pre-processing and management are all carried out through a TDBank platform. The entire platform is mainly used to solve the problems of large-scale, real-time and diverse data collection and processing under large data volumes. The access and storage problems are solved uniformly through a three-layer architecture including data access layer, processing layer and storage layer.

(1) Access layer

The access layer can support business data and data sources in various formats, including different DBs, file formats, message data, etc. The data access layer will unify the various collected data into an internal data protocol to facilitate the use of subsequent data processing systems.

(2) Processing layer

Next, the processing layer uses plug-ins to support various forms of data preprocessing. For offline systems, an important function is to classify and store the data collected in real time. The data needs to be classified and stored according to certain dimensions (such as a key value + time). At the same time, the granularity (size/time) of the storage file also needs to be customized so that the offline system can perform offline calculations at the specified granularity. For online systems, common preprocessing processes include data filtering, data sampling, and data conversion.

(3) Data storage layer

The processed data uses HDFS as the storage medium for offline files. Ensure that data storage is reliable as a whole, and then finally store this processed data into Tencent's internal distributed data warehouse TDW.

Figure 7 TDW architecture diagram

TDBank collects data in real time from the business data source, performs pre-processing and distributed message caching, and then distributes it to the back-end offline and online processing systems in accordance with message subscription.

Figure 8 TDBank data collection and access system

TDBank builds a bridge between the data source and the data processing system, decouples the data processing system from the data source, and provides data support for offline computing TDW and online computing TRC platform. Currently, through continuous improvement, the previous Linux+HDFS model has been transformed into a cluster+distributed message queue model, reducing the amount of messages that could be processed in a day to 2 seconds!

From the perspective of practical applications, when considering data collection and access, products should mainly pay attention to several dimensions.

l Unification of multiple data sources. In general, in actual application processes, there are sources of data in different formats. At this time, the collection and access part requires unified conversion of these data sources.

l Real-time and efficient data collection. Since most systems are online systems, the timeliness requirements for data collection are relatively high.

l Dirty data processing: For some dirty data that will affect the entire analysis and statistics, logical shielding needs to be performed at the access layer to avoid many unpredictable problems caused by this part of data during subsequent statistical analysis and application.

2. Data storage and calculation

After completing data reporting, collection and access, the data enters the storage stage. Let’s continue with Tencent as an example.

Within Tencent, there is a distributed data warehouse for storing data, internally codenamed TDW. It supports offline storage and computing of hundreds of PB-level data, providing a massive, efficient and stable big data platform and decision-making support for the business. It is built based on the open source software Hadoop and Hive, and has undergone a lot of optimization and transformation based on the company's specific circumstances such as large data volumes and complex calculations.

According to the information released to the public, TDW has undergone a lot of optimization and transformation based on the open source software Hadoop and Hive, and has become Tencent's largest offline data processing platform. The total number of machines in the cluster is 5,000, the total storage exceeds 20PB, and the average daily computing volume exceeds 500TB. It covers more than 90% of Tencent's business products, including Guangdiantong recommendations, user portraits, data mining and various business reports, all of which provide basic capabilities through this platform.

Figure 9: Tencent TDW distributed data warehouse

Figure 10 TDW business diagram

From the perspective of practical applications, the data storage part mainly considers several issues:

l Data security. A lot of data is irrecoverable, so the security and reliability of data storage is always the most important. Be sure to devote the most energy and attention.

l The efficiency of data calculation and extraction. As a storage source, there will be a lot of data query and extraction analysis work in the future, and the efficiency of this part needs to be ensured.

l Data consistency: the stored data must be consistent between the primary and backup servers.

Step 9: Get Data

It is the process by which product managers and data analysts obtain data from data systems. Common methods are data reports and data extraction.

The format of the report will generally be clarified at the data demand stage, especially for companies with accumulated experience. There will usually be a report template and you just need to fill in the indicators according to it. More powerful data platforms can configure and calculate self-service reports by self-selecting fields (table headers) according to analysis needs.

Here are some principles for designing data reports:

1. Provide continuous cycle query function

(1) The report must provide the start time of the query, and the data within the specified time range can be viewed. It is taboo to only have one point in time and not be able to see the trend of the data.

(2) Data within a period of time can be segmented or summarized, and different stages can be compared.

2. The query conditions match the dimensions

(1) Provide as many corresponding query conditions as there are dimensions. Try to satisfy the analysis of every dimension.

(2) The query conditions should provide the functions of opening, closing, and filtering of specific values. You can look at the overall picture, the details, and the single thing.

(3) The order of the query conditions should correspond to the order of the dimensions as much as possible, preferably in descending order.

3. The chart should be consistent with the data

(1) The trend shown in the chart should be consistent with the corresponding data to avoid data disputes;

(2) If there is a graph, there must be data, but data can be available without a graph;

(3) There should not be too many indicators in the chart, and the gaps between the indicators should not be too large.

4. The report should be single

(1) For one report, only one analysis function should be performed, and multiple functions should be separated into different reports as much as possible;

(2) Avoid jumps in reports as much as possible;

(3) The report only provides query function.

Take a look at some commonly used reports, including traffic reports of WEB products from Baidu, focusing on PV, UV, new visitor ratio, bounce rate, average visit duration, etc.

Let’s talk specifically about the bounce rate. This data reflects the value of the landing page (not necessarily the homepage) when users enter the website, and whether it can attract users to click once. If users reach the landing page and there is no click, the bounce rate increases.

Figure 11 Baidu statistics web data report

Looking at the product retention rate data report provided by the Umeng data platform, the retention rates that are usually concerned about are: retention after 1 day, retention after 7 days, and retention after 30 days.

Figure 12 Umeng’s retention data report

Data extraction is a very common requirement in product operations, such as extracting a batch of products with good sales and their related fields, extracting a batch of users with specified conditions, etc. Similarly, a data platform with more complete functions will have a self-service data extraction system. If it cannot meet the self-service needs, data developers will be required to write scripts to extract data.

As shown in Figure 12, Tencent’s internal data portal is responsible for the data reporting, data extraction, and data reporting functions of many products.

Figure 13 Tencent Data Portal Home Page

Step 10: Observe and analyze data

The main task here is monitoring and statistical analysis of data changes. Usually, we will automatically output daily reports for the data and mark abnormal data. Visual output of data is very important.

The commonly used software are EXCEL and SPSS, which can be said to be basic skills for data analysis. I will share my personal methods and techniques of using these two software in actual work later. It should be noted that before performing data analysis, you should first verify the data accuracy to determine whether the data is what you want. For example, whether the data definition and reporting logic are strictly in accordance with the requirements document, and whether there is a possibility of data loss in the data reporting channel. It is recommended to extract and sample the original data to determine the data accuracy.

Data interpretation is crucial in this link. The same data will have very different interpretation results due to differences in product familiarity and analysis experience. Therefore, product analysts must have a good understanding of the product and users.

Absolute values are usually difficult to interpret, and the meaning of the data can usually be better expressed through comparison.

For example, in the first week after a product goes online, the average daily number of new registrations is 100,000, which seems to be good data. However, if this product is a new product launched by YY Voice, and users are reached through YY pop-up messages, with tens of millions of user exposures every day and only 100,000 new users, it cannot be considered good product data.

Figure 13: Clearer data meaning through comparison

For vertical comparison , for example, when analyzing the data changes of newly registered YY Voice users, you can compare it with the same period last week, the same period last month, and the same period last year to see if there are similar data change patterns.

In a horizontal comparison , the changes in the registration data of new YY Voice users can be analyzed from the funnel model, looking at the different channels from which users come from to see whether the conversion rate of each channel has changed. For example, in the top-level funnel, are there any data in the user access channel that has changed significantly, and which link in the channel has a change in conversion rate data. You can also make horizontal comparisons of different businesses, such as comparing YY Voice's new registration data, Duowan.com's traffic data, and YY Game's new registered user data to find out the reasons for data changes.

The combination of vertical and horizontal comparison is to compare the curves of multiple data changes in the same period time period, such as the six-month data changes of YY's newly registered users, Duowan.com's traffic data, and YY game's newly registered users. The three curves are compared at the same time to find the key nodes of a certain data anomaly, and then look for the operation log to see if there is any organization of operation activities, any influence of external events, and any influencing factors of special days.

Step 11: Product Evaluation and Data Application

This is the end point of the data operation closed loop, and also a new starting point. Data reports are by no means just for display, nor are they used to answer questions from leaders. Instead, they serve to optimize products and carry out operations. Just like the performance of product personnel, it is not just about whether product projects are completed and released on time, but also about continuously observing and analyzing product data, evaluating product health, and applying the accumulated data to product design and operations.

For example, Amazon’s personalized recommendation products, QQ Music’s “Guess What You Like”, Taobao’s “Time Machine”, Toutiao’s “Recommended Reading” and so on. Data product applications can be roughly divided into the following categories:

(1) Precision marketing represented by performance advertising

The recommendation cycle is short and the real-time requirement is high; the user's short-term interests and immediate behaviors have a great influence; the delivery scenario context and the characteristics of the visiting population.

Product examples: Google, Facebook, and WeChat Moments.

(2) Content recommendation represented by video recommendation

The cumulative influence of long-term interests is great; time periods and hot events; the relevance of multi-dimensional content is very important.

Product example: Youtube

(3) Shopping recommendations represented by e-commerce recommendations

A combination of long-term + short-term interests + immediate behaviors; closest to reality, seasons and user life information are critical; pursuing orders and transactions, payment related.

Product examples: Amazon, Taobao, and JD.com.

One picture summarizes the 11-step rules of data operation

Finally, a picture summarizes the 11 steps of data operation:

Figure 14 11 steps of data operation

From setting product goals to final product evaluation and operation optimization based on the goals, a closed loop of data operations is formed. This process and specification requires that all departments have a unified awareness, and that each product terminal can report data in a unified manner according to the standardized process, establish a company-level unified data center, and build a data warehouse. Only then will it be possible to maximize the value of data and make data a productive force.

How to build a product data operation system? There are five major factors to consider:

(1) People: Full-time data operations colleagues

Full-time professional product colleagues are responsible for establishing the process and standardization of the product data system, accumulating experience, and promoting the continuous optimization and development of the system; full-time professional development colleagues are responsible for data reporting, report development, database development and maintenance, etc., to ensure the development and realization of the product data system;

(2) Data backend: comprehensive and systematic data warehouse

There is a dedicated unified data warehouse to record the special individual data of its own products. Common data is obtained by making full use of the public interface of the data platform, sharing data sources and fully reducing costs.

(3) Data front-end: solidified data system display platform

We need professional report development colleagues to think about the report system in a systematic way and to execute it flexibly and iteratively, rather than simply accepting report demands and causing report proliferation.

(4) Work norms: Demand realization process

This is the 11-step process and method for building a product data system described above. There are two points to pay attention to in the data requirements. One is to solidify the demand development process, and the other is to tool temporary demands.

(5) Work output: data application

Routine data work involves various data analyses and the output of daily, weekly and monthly reports; providing a basis for decision-making based on data analysis. Carry out data product development, such as accurate recommendation, user life cycle management and other product planning.

Author: Mantou Business School, authorized to be published by Qinggua Media.

Source: Mantou Business School

<<: CCTV "Lecture Room" program series video 179 episodes 2641 episodes Mandarin Chinese subtitles collection Baidu cloud download

>>: Attached case | By promoting information flow in this way, the conversion rate can be increased by 50% without any pressure!

Recognize noise with a discerning eye and draw it!

Blog

Buying a membership on iOS is 100 yuan more expensive than on Android. Is this price discrimination due to the "Apple tax"?

Blog

How much does it cost to join a commercial mini program in Siping?

The latest research! This food additive may increase the risk of diabetes and cardiovascular disease, hidden in these foods...

Blog

Nothing new under the dome? Why not look at the tech startup boom under the haze?

Blog

Creating a Line Chart with SwiftUI Charts in iOS 16

Blog

If you don't wash your hair for a day, it will become oily! Is there any hope for people with oily hair? You may be washing your hair the wrong way!

Sitting at the workstation, my mind is empty Othe...

Build a complete product data operation system in 11 steps!

Recognize noise with a discerning eye and draw it!

Buying a membership on iOS is 100 yuan more expensive than on Android. Is this price discrimination due to the "Apple tax"?

How much does it cost to join a commercial mini program in Siping?

Food Safety | Eight questions and eight answers to help you understand crayfish again! This is the right way to eat it!

The most complete! Summary of mini program channel promotion, saved!

What is the hottest healthy food in 2021? You must be eating it every day!

The secrets of UGC community content operation are all here!

The latest research! This food additive may increase the risk of diabetes and cardiovascular disease, hidden in these foods...

Nothing new under the dome? Why not look at the tech startup boom under the haze?

Creating a Line Chart with SwiftUI Charts in iOS 16

Recommend

Smart routing: the battle for the unclear “entrance”

Ministry of Industry and Information Technology: 2017 Industrial Internet of Things White Paper

VIP membership growth system of the three major mainstream video platforms!

Why are mobile phone batteries getting bigger and bigger, but their battery life is getting shorter?

Major breakthrough in non-volatile memory: expected to completely replace existing hard drives

How much does it cost to produce a real estate mini program on the market in Baishan?

The only thing standing between Alibaba Music and a world-class music organization is Gao Xiaosong?

Tips for promoting products on Xiaohongshu!

Monetization activities: How to make users willingly pay for your products?

Tik Tok operation strategy for K12 education industry!

How to shoot alcohol brand advertisements?

Learn these 15 details to help you quickly improve the user experience of B-side charts

How do I migrate a WeChat mini program to a Baidu mini program?

A review of the top 10 advertising and marketing failures in 2021

If you don't wash your hair for a day, it will become oily! Is there any hope for people with oily hair? You may be washing your hair the wrong way!