What skills are needed for data operations ? How to build your own data operation system? The author of this article uses more than ten years of work experience to help you improve your data operation system step by step.
In my 18 years of work experience in the Internet industry, a large part of my work was devoted to data operations. From QQ Show to YY Voice, and then to Thunder, I have experienced the processes of product data operation, including process optimization, platform construction, and analytical application. I have personally witnessed the important role that data operations play in product growth. Many people's understanding of data operation is limited to data statistics, cause analysis, etc. In fact, these are only a small part of data operation work. Data ultimately serves the product. Data operation focuses on operation, and data is a carrier. What does data operations do? My personal understanding is: Drive the team to clarify product goals, define product data indicators, create data reporting channels and rule processes, efficiently promote the realization of data needs, observe product data, make data warnings, analyze the reasons for data changes, iterate and operate products based on analysis results, provide a basis for product decisions, use data to drive product and organizational growth, and achieve organizational goals. In simple terms, data operations should clarify the following five issues:
Here is a rough summary of the skills that data operations need to master: There are many skill concepts mentioned above. In fact, the most basic one is to learn statistics first, then delve into business practice, master analysis tools proficiently, such as the most commonly used Excel, and then learn a data mining tool. I personally use SPSS. The functions of SPSS include data management, statistical analysis, chart analysis, output management, etc. The SPSS statistical analysis process includes descriptive statistics, mean comparison, general linear model, correlation analysis, regression analysis, log-linear model, cluster analysis, data simplification, survival analysis, time series analysis, multiple responses, etc. The tools are not difficult to learn. What is important is to learn statistics, know what analysis methods to use in different scenarios, and how to interpret and apply the analysis results. Later, I compiled my own data operation work experience into a hierarchical diagram of the enterprise's data operation system: How can we build a complete product data operation system? I have sorted out and summarized my own work experience, and the whole process can be divided into the following 11 steps for your reference. The first step is to set product goals This is the starting point of data operations and also the standard for evaluating the product after it goes online, thus forming a closed loop. When setting goals, you must not make a random decision. Instead, you should make the final decision based on comprehensive calculations such as business development, industry development, competitive product analysis, product development trends in previous years, and product conversion rules. The SMART principle is often used to measure goals.
It means that work indicators should be specific and measurable, and cannot be general. For example, when we set the product goal of YY voice basic experience, if it was to improve the product experience, it was not specific enough and everyone’s understanding was inconsistent. At that time, our basic product goal was to increase the next-day retention of new users, which was very specific.
It means that the performance indicators are quantitative or behavioral, and the data or information to verify these performance indicators are available; to improve the next-day retention rate of new users, specific values need to be given.
It means that performance indicators can be achieved with effort, and we should avoid setting goals that are too high or too low. The next-day retention rate of newly registered users is not something we arrive at on a whim. At that time, we set a relatively challenging goal based on the historical data of YY's new users' next-day retention rate and the industry reference values of the new registered users' retention rate of game users. We increased the next-day retention rate of newly registered users from 25% to 35%.
It is related to other work goals; performance indicators are related to the job; the next-day retention rate of new users is closely related to user behavior, such as the user's recognition of voice tools, the user's preference for the content of the YY platform, etc., so the next-day retention rate of new users is strongly correlated with product performance and content popularity.
Focus on specific deadlines for completing goals. The product goal can be formulated as follows: before December 31, 2013, increase the next-day retention rate of newly registered YY Voice users from 25% to 35%. The increase in the next-day retention rate of new users means more active conversions of users, driving the growth of the overall number of active users. It is important to note here that we need to have insight into the essence behind the goal and not just rely on data. For example, the project I worked on to increase the retention rate of newly registered YY Voice users was very easy to achieve if we only looked at the data changes in the retention rate. I remember that one of the methods I used at that time was user classification. I classified users from different channels and with different behaviors, and found that some junk new users greatly affected the overall retention rate data. Many of these users were machine-registered and not real users. After eliminating these users, the retention data was much higher. But this does not mean that we have completed the task. Because behind this goal, we actually need to gain growth in active users. The new user retention rate is just a data reflection, so we cannot only look at the new user retention rate as a single indicator. We must measure the value of our work from multiple indicators such as increasing the number of new user registrations, effective user retention, user activity, and paid conversions. Step 2: Define product data indicators Following the above goal setting, we need to consider data indicators. In the case of door-to-door service, the goal we set is the new user retention rate. After achieving the new user retention rate, we need to judge whether the realization of this data indicator has really promoted the growth of active users of the entire product. Product data indicators are specific numerical values that reflect the healthy development of products. We need to give a clear definition of data indicators, including data reporting methods, calculation formulas, etc. For example, the next-day retention rate above can be defined as: the next-day retention rate is a ratio, the denominator is the number of YY accounts that are newly registered and log in to the YY client on the same day, and the numerator is the number of YY accounts in the denominator that log in to the YY client again on the next day. Pay attention to the details here. The first and second days need to have clear time points, such as 0:00 to 24:00, which is calculated as one day. The problem is that a new user registers and logs in to the YY client at 23:00 on the first day, and logs off at 1:00 in the morning of the next day. According to the above definition, this user may not be recorded as a next-day retained user because the data reporting details are not clearly defined here. The definition is logging into the YY client again on the second day. The user in the above case did not log in on the second day, but he was indeed a user who was logged in for two consecutive days. Therefore, for this definition, additional details are needed: user login status, if the heartbeat packet is reported every 5 minutes, then the new user can be reported as a logged-in user for the next day. If the user goes offline before 00:05 and continues to be not logged in until 24:00 the next day, he will not be recorded as a retained user. We select data indicators based on product goals. For example, for web products, we often use data such as PV, UV, crash rate, average PV per person, and length of stay to measure products. Defining a product indicator system requires consensus among various teams such as product and development. The definition of data indicators is clear and well documented, and will not cause differences in understanding of data interpretation. The focus of data indicators will be different at different stages of the product life cycle. The following table roughly lists some indicators that need to be focused on at each stage. In addition to common user indicators and revenue indicators, we must also pay attention to technical performance indicators. There are five key points for good data indicators: (1) Able to reflect the satisfaction of user needs, product core value and development trends. The improvement of these indicators shows that the company is moving in a good direction. (2) Good data indicators are comparable. Comparing the performance of competing products over different time periods, user groups, and between them can provide better insight into the actual direction of the product. (3) Easy to understand and controllable. It is easy to understand, remember and count. (4) A good data indicator is often a ratio. (5) Indicators evolve with the business. The key indicators at different stages should change as the business changes. The third step is to build a product data indicator system Based on the proposed data indicators, we summarize and organize the indicators according to product logic to make them more organized. The next-day retention rate of new users is a core goal we set, but in fact, it is not enough to only look at the next-day retention rate. We also need to comprehensively examine the various factors that affect user retention rate in order to more accurately understand the healthy development of the product. As shown in Figure 1, this is a commonly used indicator system, including: new users, active users, payments, and other data. Figure 1 Common data indicator system for Internet products When we are making the YY voice client product, we will use the following indicator system, including four aspects: account system, relationship chain data, status perception data, communication ability, etc. Specific indicators include: the number of friends, the duration of watching channel programs, the duration of IM chats, the switching and duration of personal status, etc., as shown in Figure 2: Figure 2 IM product data indicator system Step 4: Propose product data requirements The establishment of a product indicator system is not achieved overnight. Product managers will put forward data requirements with different focuses based on the different stages of product development. Generally, companies will have templates for product requirement documents to facilitate communication between colleagues in product and data reporting development, data platform and other departments, and to carry out data construction. For entrepreneurial SMEs, the process from proposing product data requirements to reporting them may only require 1-2 people, but it is also recommended to build data documents, such as the definition of data indicators, data calculation logic, etc. Figure 3 is the basic product data requirement implementation process that I established in the YY voice client team. In fact, most of the time, we don’t need such a data requirement process. It’s just that we were just starting to standardize data requirements at the time. The data requirement review process was also a training process to make more colleagues data-aware. Later, data requirements were integrated into the product requirement process. Figure 3 YY Division Basic Product Data Requirements Implementation Flowchart (Implementation) There are two common data reporting requirements:
(1) Example of data reporting requirements using standard protocols Table 1 Example template for reporting data requirements for standard protocols (2) Example of custom protocol reporting data requirements Table 2 Example template for custom protocol reporting data requirements Registration name: YY Business Unit – Basic Product Group – Game Live Operation Daily Report. Step 5: Report data This step is to develop according to the product manager's data requirements, follow the data reporting specifications, complete the reporting development, and report the data to the data server. The key to reporting data is the construction of the data reporting channel. When I worked at Tencent, I did not experience the difficulty of this link, because the data platform department had already built a complete data channel. Development only required certain rules and the use of a unified data SDK for data reporting. Later, when I worked at YY, a development-oriented company, we started building the reporting channel, which gave me more opportunities to practice and improve myself. One of the key links is the data reporting test, which once caused unnecessary trouble due to the lack of testing resources for this link. Many startups do not have their own data platforms, so they can use third-party data platforms: for web products, they can use Baidu Statistics (tongji.baidu.com); for mobile products, they can use platforms such as Umeng (www.umeng.com) and TalkingData (www.talkingdata.com). For example, the following table shows the sending function send_web_pv for reporting page traffic data, which is derived from the Thunder Hubble data platform specification. Table 3 Sending function send_web_pv for reporting page traffic data The following table is an example of a live broadcast APP data reporting point. (Data embedding means adding statistical logic to functional logic). Table 4 Data reporting example of a live broadcast APP Currently, there is also a method of reporting data without embedding points. Please refer to this article "Unveiling the mystery of GrowingIO without embedding points". Steps 6-8: Data collection and access, storage, scheduling and calculation Each step is a science. For example, data collection involves interface creation, which requires consideration of the extensibility of data fields, the ETL data cleaning process during data collection, and the correctness verification of client data reporting. Data storage, scheduling, and computing are even more challenging technical tasks in the era of big data.
ETL is the abbreviation of Extract-Transform-Load, which is used to describe the process of extracting, transforming and loading data from the source to the destination. The term ETL is more commonly used in data warehouses, but its application is not limited to data warehouses. ETL is an important part of building a data warehouse. Users extract the required data from the data source, clean the data, and finally load the data into the data warehouse according to the pre-defined data warehouse model. The figure below is a common flow chart of a product data system. Data collection, storage, and calculation are usually completed in the data center in the figure. Figure 4 Data system flow After confirming the data report, the next few things are more technical. First of all, we need to figure out how the reported data should be collected and stored in our data center. Data collection is divided into two steps. The first step is to report from the business system to the server. This part is mainly through cgi or background server. After a unified logAPI call, the original flow data is aggregated in the logServer for storage. When the amount of data becomes large, you need to consider using distributed file storage. The commonly used external distributed file storage is mainly HDFS. I won’t go into details here. Figure 5: Architecture diagram of raw data reporting and storage in files After the data is stored in the file, the second step is to enter the ETL stage. ETL refers to the process of extracting, transforming, and loading the log from the text, cleaning it based on the analysis requirements and data dimensions, and then storing it in the data warehouse. Take Tencent as an example: Tencent's big data platform now mainly supports massive data access and processing from two directions: offline and real-time. The core systems include TDW, TRC and TDbank. Figure 6 Tencent data platform system The data collection, distribution, preprocessing and management of Tencent Data Platform are all achieved through a TDBank platform. The entire platform is mainly used to solve the problems of large-scale, real-time and diverse data collection and processing under large data volumes. The access and storage problems are solved uniformly through a three-layer architecture including data access layer, processing layer and storage layer. (1) Access layer The access layer can support business data and data sources in various formats, including different DBs, file formats, message data, etc. The data access layer will unify the various collected data into an internal data protocol to facilitate the use of subsequent data processing systems. (2) Processing layer Next, the processing layer uses plug-ins to support various forms of data preprocessing. For offline systems, an important function is to classify and store the data collected in real time. The data needs to be classified and stored according to certain dimensions (such as a key value + time). At the same time, the granularity (size/time) of the storage file also needs to be customized so that the offline system can perform offline calculations at the specified granularity. For online systems, common preprocessing processes include data filtering, data sampling, and data conversion. (3) Data storage layer The processed data uses HDFS as the storage medium for offline files. Ensure that data storage is reliable as a whole, and then finally store this processed data into Tencent's internal distributed data warehouse TDW. Figure 7 TDW architecture diagram TDBank collects data in real time from the business data source, performs pre-processing and distributed message caching, and then distributes it to the back-end offline and online processing systems in accordance with message subscription. Figure 8 TDBank data collection and access system TDBank builds a bridge between the data source and the data processing system, decouples the data processing system from the data source, and provides data support for offline computing TDW and online computing TRC platform. Currently, through continuous improvement, the previous Linux+HDFS model has been transformed into a cluster+distributed message queue model, reducing the amount of messages that could be processed in a day to 2 seconds! From the perspective of practical applications, when considering data collection and access, products should mainly focus on several aspects:
After completing data reporting, collection and access, the data enters the storage stage. Let’s continue with Tencent as an example. Within Tencent, there is a distributed data warehouse for storing data, internally codenamed TDW. It supports offline storage and computing of hundreds of PB-level data, providing a massive, efficient and stable big data platform and decision-making support for the business. It is built based on the open source software Hadoop and Hive, and has undergone a lot of optimization and transformation based on the company's specific circumstances such as large data volumes and complex calculations. According to the information released to the public, TDW has undergone a lot of optimization and transformation based on the open source software Hadoop and Hive, and has become Tencent's largest offline data processing platform. The total number of machines in the cluster is 5,000, the total storage exceeds 20PB, and the average daily computing volume exceeds 500TB. It covers more than 90% of Tencent's business products, including Guangdiantong recommendations, user portraits, data mining and various business reports, all of which provide basic capabilities through this platform. Figure 8 Tencent TDW distributed data warehouse Figure 9 TDW business diagram From the perspective of practical applications, the data storage part mainly considers several issues:
The key to this step, for enterprises to build their own private data platforms, is to find architects and engineers with experience in data platform development, which will achieve twice the result with half the effort. Of course, if it is a small and medium-sized enterprise, it is more efficient to use cloud products directly. Step 9: Get data It is the process by which product managers and data analysts obtain data from data systems. Common methods are data reports and data extraction. The format of the report will generally be clarified at the data demand stage, especially for companies with accumulated experience. There will usually be a report template and you just need to fill in the indicators according to it. More powerful data platforms can configure and calculate self-service reports by self-selecting fields (table headers) according to analysis needs. Here are some principles for designing data reports: 1. Provide continuous cycle query function (1) The report must provide the start time of the query, and the data within the specified time range can be viewed. It is taboo to only have one point in time and not be able to see the trend of the data. (2) Data within a period of time can be segmented or summarized, and different stages can be compared. 2. The query conditions match the dimensions (1) Provide as many corresponding query conditions as there are dimensions. Try to satisfy the analysis of every dimension. (2) The query conditions should provide the functions of opening, closing, and filtering of specific values. You can look at the overall picture, the details, and the single thing. (3) The order of the query conditions should correspond to the order of the dimensions as much as possible, preferably in descending order. 3. The chart should be consistent with the data (1) The trend shown in the chart should be consistent with the corresponding data to avoid data disputes; (2) If there is a graph, there must be data, but data can be available without a graph; (3) There should not be too many indicators in the chart, and the gaps between the indicators should not be too large. 4. The report should be single (1) For one report, only one analysis function should be performed, and multiple functions should be separated into different reports as much as possible; (2) Avoid jumps in reports as much as possible; (3) The report only provides query function. Take a look at some commonly used reports, including traffic reports of WEB products from Baidu, focusing on PV, UV, new visitor ratio, bounce rate, average visit duration, etc. Let’s talk specifically about the bounce rate. This data reflects the value of the landing page (not necessarily the homepage) when users enter the website, and whether it can attract users to click once. If users reach the landing page and there is no click, the bounce rate increases. Figure 10 Baidu statistics web data report Looking at the product retention rate data report provided by the Umeng data platform, the retention rates that are usually concerned about are: retention after 1 day, retention after 7 days, and retention after 30 days. Figure 11 Umeng’s retention data report Data extraction is a very common requirement in product operations, such as extracting a batch of products with good sales and their related fields, extracting a batch of users with specified conditions, etc. Similarly, a data platform with more complete functions will have a self-service data extraction system. If it cannot meet the self-service needs, data developers will be required to write scripts to extract data. Step 10: Observe and analyze data The main task here is monitoring and statistical analysis of data changes. Usually, we will automatically output daily reports for the data and mark abnormal data. Visual output of data is very important. Data analysis is often used to: understand product status, understand development trends, discover problems, identify users, and promote marketing. The commonly used software are EXCEL and SPSS, which can be said to be basic skills for data analysis. I will share my personal methods and techniques of using these two software in actual work later. It should be noted that before performing data analysis, you should first verify the data accuracy to determine whether the data is what you want. For example, whether the data definition and reporting logic are strictly in accordance with the requirements document, and whether there is a possibility of data loss in the data reporting channel. It is recommended to extract and sample the original data to determine the data accuracy. Data interpretation is crucial in this link. The same data will have very different interpretation results due to differences in product familiarity and analysis experience. Therefore, product analysts must have a good understanding of the product and users. Absolute values are usually difficult to interpret, and the meaning of the data can usually be better expressed through comparison. For example, in the first week after a product goes online, the average daily number of new registrations is 100,000, which seems to be good data. However, if this product is a new product launched by YY Voice, and users are reached through YY pop-up messages, with tens of millions of user exposures every day and only 100,000 new users, it cannot be considered good product data. Figure 13: Clearer data meaning through comparison For vertical comparison, for example, when analyzing the data changes of newly registered YY Voice users, you can compare it with the same period last week, the same period last month, and the same period last year to see if there are similar data change patterns. In a horizontal comparison, the changes in the registration data of new YY Voice users can be analyzed from the funnel model, looking at the different channels from which users come from to see whether the conversion rate of each channel has changed. For example, in the top-level funnel, are there any data in the user access channel that has changed significantly, and which link in the channel has a change in conversion rate data. You can also make horizontal comparisons of different businesses, such as comparing YY Voice's new registration data, Duowan.com's traffic data, and YY Game's new registered user data to find out the reasons for data changes. The combination of vertical and horizontal comparison is to compare the curves of multiple data changes in the same period time period, such as the six-month data changes of YY's newly registered users, Duowan.com's traffic data, and YY game's newly registered users. The three curves are compared at the same time to find the key nodes of a certain data anomaly, and then look for the operation log to see if there is any organization of operation activities, any influence of external events, and any influencing factors of special days. The output of data analysis results usually adopts an intuitive visual presentation method. Choose a reasonable chart to make the analysis results more intuitive. Recommend two practical visualization tools: Baidu Tushuo: https://tushuo.baidu.com Word cloud: https://wordart.com Customize pictures to generate personalized word cloud charts. About the word cloud map strategy article: Word cloud map strategy (Part 2): Customize graphics to create personalized word cloud maps Step 11: Product evaluation and data application This is the end point of the data operation closed loop and also a new starting point. Data reports are by no means just for display, nor are they for responding to questions from leaders. Instead, they serve to optimize products and carry out operations. Just like the performance of product personnel, it is not just about whether product projects are completed and released on time, but also about continuously observing and analyzing product data and evaluating product health. At the same time, the accumulated data should be applied to product design and operations, such as Amazon's personalized recommendation products, QQ Music's Guess What You Like, Taobao's Time Machine, Toutiao's Recommended Reading, and so on. Data product applications can be roughly divided into the following categories: (1) Precision marketing represented by performance advertising The recommendation cycle is short and the real-time requirement is high; the user's short-term interests and immediate behaviors have a great influence; the delivery scenario context and the characteristics of the visiting population. Product examples: Google, Facebook, and WeChat Moments. The following figure shows WeChat's user data targeting capabilities, which can accurately locate users from multiple dimensions such as region, gender, age, mobile phone, marriage, education, etc.: Although many people say they cannot afford WeChat Moments ads, in many cases it is up to you whether you want to buy them or not. As data accumulates, the ads will become more and more accurate. (2) Content recommendations represented by audio and video recommendations The cumulative influence of long-term interests is great; time periods and hot events; the relevance of multi-dimensional content is very important. Product examples: Youtube, NetEase Cloud Music, Tik Tok, QQ Music The pictures below show the young lady, Jack Ma, and scenery recommended to me by TikTok, which generally meet the preferences of a 40-year-old man, Internet practitioner, and travel enthusiast like me. (3) Shopping recommendations represented by e-commerce recommendations A combination of long-term + short-term interests + immediate behaviors; closest to reality, seasons and user life information are critical; pursuing orders and transactions, payment related. Product examples: Amazon, Taobao, and JD.com. The picture below is the recommendation Taobao gave me, which roughly meets the product recommendations for a male user who has children at home and likes outdoor sports. Summarize Finally, a picture summarizes the 11 steps of data operation: Figure 14 11 steps of data operation From setting product goals to final product evaluation and operation optimization based on the goals, a closed loop of data operations is formed. This process and specification requires that all departments have a unified awareness, and that each product terminal can report data in a unified manner according to the standardized process, establish a company-level unified data center, and build a data warehouse. Only then will it be possible to maximize the value of data and make data a productive force. To summarize the construction of the product data operation system from the perspective of organizational implementation, the following five factors can be considered: (1) People: Full-time data operations colleagues Full-time professional product colleagues are responsible for establishing the process and standardization of the product data system, accumulating experience, and promoting the continuous optimization and development of the system; full-time professional development colleagues are responsible for data reporting, report development, database development and maintenance, etc., to ensure the development and realization of the product data system. (2) Data backend: comprehensive and systematic data warehouse There is a dedicated unified data warehouse to record the special individual data of its own products. Common data is obtained by making full use of the public interface of the data platform, sharing data sources and fully reducing costs. (3) Data front-end: solidified data system display platform We need professional report development colleagues to think about the report system in a systematic way and to execute it flexibly and iteratively, rather than simply accepting report demands and causing report proliferation. (4) Work norms: Demand realization process This is the 11-step process and method for building a product data system described above. There are two points to pay attention to in the data requirements. One is to solidify the demand development process, and the other is to tool temporary demands. (5) Work output: data application Routine data work involves various data analyses and the output of daily, weekly and monthly reports; providing a basis for decision-making based on data analysis. Carry out data product development, such as accurate recommendation, user life cycle management and other product planning. The above content is a summary of my many years of work practice. I would also like to thank many of my colleagues in data work who have worked with me for their help: Gong Wei, Chang Bo, Chun Ge, Xia Cong, Yu Wen, Zhihua, Jing Mi, Xiao Wei, Jian Yu, etc. Source: BLUEMIDOU |
<<: 6 ways to increase the conversion rate of promotion pages by 30 times!
>>: "Lean and Muscle" Fitness Essentials
Q: How to promote mini programs? A comprehensive ...
1. Becoming a content operator from scratch ( Int...
Many large or small businesses want to mention on...
WeChat, a social software owned by Tencent, has b...
Since the development of China's Internet, in...
Being keen on breakthroughs in gameplay and ignor...
1. To organize a successful offline event , we ne...
Training and education are the best areas for soc...
The difference between Google SEO optimization ne...
Guo Ruoxi's 7-day breast shaping plan resourc...
As short videos have become an important tool for...
Introduction to the practical strategy resource o...
When it comes to Xiaohongshu’s promotion methods,...
This article is a series by Huang Youcan, the fou...
How much does it cost to produce the Huai'an ...