Big data prescription for startups | WOT Technology Clinic Second Phase Diagnosis

On October 11, the second consultation of WOT Technical Clinic ended. This issue’s specially invited clinic expert Dong Sibei, senior architect of Dangdang Advertising and Recommendation Department, prescribed some solutions for the typical big data problems encountered by many start-ups.

Dong Sibei: Senior architect of Dangdang Advertising and Recommendation Department, graduated from Jilin University with a bachelor's and master's degree, former Sogou Map Development Manager, currently senior architect of Dangdang Recommendation Department

Engaged in: architecture design, data analysis, website security and other related work. Also focus on: the application of new technologies such as Internet security, anti-fraud, machine learning, etc.

Question 1: Many mobile APP startups currently only use traditional methods to store data or analyze data. So big data is still at the conceptual stage for us. After all, we are not BAT. So can you let the experts explain it with one or two cases, from demand to technical architecture, so that startups can get benefits faster?

Dong Sibei: I think the questioner is most concerned about how to "get benefits faster". The so-called quick benefits generally refer to: less time cost; less manpower input; low operation and maintenance costs. In fact, it is a data analysis platform. The main analysis issues are such as what data are there? What is the cost? What is the benefit? What is the result?

APP application analysis, market data analysis;
Business data and behavioral data;
Cost: labor cost, time cost, operation and maintenance cost;
Changing needs;

We are mainly talking about APP applications, where data collection and data analysis are needed. If you want to reduce costs, the easiest way is to directly use a third-party platform to collect data. For example: use a third-party platform like Umeng. For collecting data storage, of course you don’t have to start from scratch, you can rent a public cloud and pay on demand. For example: Alibaba Cloud, Qiniu Cloud Storage, etc.

However, using third-party platform services will naturally bring negative effects: such as Umeng, which is simple, easy to use and free; the disadvantage is that the data cannot be retrieved for in-depth and cross-analysis, and there is no absolute security. It mainly depends on what the early decision of the APP startup company is. Using third-party data statistics services is just a transitional solution. Dangdang.com is currently using Umeng services. We also cannot conduct in-depth analysis of these data, but we will do our own data collection and analysis, so that the two can coexist and verify each other to produce additional effects. Finally, if these data want to be efficient, they still need to be mined again. At this time, different data should be treated differently, and business data and behavioral data should be separated. Finally, the data must be displayed in the form of curve pie charts and bar charts for decision-making to show whether the system is successful.

The following are the technical points that everyone needs to pay attention to:

Data reports and dashboards

Writing SQL for business data: flexible customization of data, accuracy, real-time, and ability to conduct complex business analysis; Disadvantages: historical status is overwritten, self-expansion is complex, and computing power is limited.

Data model: Event: eventid+pageid+properties+userid; User: userid+ properties

Behavior data write logs: http transmission > kafka > hdfs, you can subscribe to messages from kafka

Data mining and analysis: python hadoop spark, decoupled from business database, powerful computing power; dimension, indicator, vulnerability analysis, such as (registration, visit, click...)

Data presentation: curves, pie charts, bar charts; tools HighCharts, OpenCharts

Data feedback: product analysis, iterative calculation

Finally, the solution to the first problem is:

Use of third-party services;
Choose a small team of 2-3 people;
Choose a good data model;
Choose the commonly used tools such as Python Hodoop Spark; 5. It is best to use MySQL for business data (such as orders, logistics, and payments). These are all structured data;
Behavioral data (search, browse, click, favorite) is generated in large quantities every day and can be directly stored in HDFS.

Question 2: We have done two big data platform projects before, one is IoT and the other is CityNext. One has fewer data formats but a large amount of data, and the other has a complex data format. It was based on Hadoop and a team of dozens of people completed storage and simple real-time analysis. What should a small company or startup team do?

Dong Sibei: A big data project requires a lot of manpower and time in the early stage. In the later stage, it is felt that the finished product is far from the initial estimate, which leads to the above problems. So what should we do at this time? I personally think that it is very common to encounter cost problems in projects such as big data platforms, especially in start-ups. The cost problem is divided into three parts: manpower cost; time cost; operation and maintenance cost. If you are a startup team, you should be more cautious. Moreover, the market and demand change very quickly. If you spend a long time in one place, it is a waste. Suggestions:

The team should be as small as possible: preferably a technical backbone team of 3-5 people, which facilitates communication and has high execution capabilities. Communication among dozens of people is too cumbersome, and poor communication can easily lead to reduced execution efficiency.
In the early stages of a business, demand and market changes rapidly, so strategies should be adjusted in a timely manner and not too much time should be spent on one area.
In the early stages of a startup team, it is best to use third-party infrastructure to reduce operation and maintenance costs, such as renting a public cloud.

Question 3: The problem I am facing at work now is that I think it is just a bunch of junk data. The business side hopes that we can dig out the value. We have tried various algorithms, but the results are not ideal. Now I want to give a theoretical limit, but I don’t know where to start. Is there any way to give the theoretical limit of the effects of various algorithms based on statistical characteristics?

Dong Sibei: I have also encountered this problem and was confused. For example, I am now working on a recommendation system, which is to recommend certain products. Can I optimize my algorithm infinitely to improve KPI (click-through rate, order share...)? If I conduct data value mining, can I mine it infinitely? If I cannot analyze and mine it infinitely, I want to know where the peak is? Theoretically, it is definitely not possible to optimize and mine infinitely.

Let me tell you a little story: In the first year, the algorithm team went from scratch and directly increased KPI by 30%. The company was very happy. In the second year, the company increased its investment and found that the KPI only increased by 10% in the second year. In the third year, the company's decision-makers invested more manpower and found that the KPI only increased by 3%. What did you find? Although the investment is increasing, the growth of KPI indicators is getting lower and lower. In fact, the third year is a bottleneck. For the questioner of the second question, I would like to say that since you have tried all the methods and still haven't found the result, it means that you have encountered a bottleneck. Continuing with the story just now, the protagonist in the story also encountered a similar problem. In the fourth year, a product manager came to the company and said: "You change the color of the product: the red one is called "literary red", the blue one is called "diaosi blue", and the product is marked with the symbols of literature and art and diaosi." Next, guess what the result is? The KPI instantly increased by 30%. Therefore, sometimes the bottleneck of data analysis actually reflects the bottleneck of the product model. If you really verify your data very honestly, then you can be very sure that the product form or product model has encountered a bottleneck. Generally speaking, if your optimization or mining has dropped significantly, you should reduce your investment, but that doesn’t mean you should stop analyzing. Then, you should transition to the product direction to find out the reasons.

Question 4: For traditional enterprises, what should they pay attention to when deciding to build their own big data platform? What should they pay attention to when choosing technology? What should they do when deciding to build relevant personnel? Chenglian E-commerce is a traditional enterprise mainly engaged in refractory materials. If we start to build our own big data platform, is there any more universal model?

Dong Sibei: Let's talk about the background first. The mall has become a fitting room, and everyone goes online to buy. Similarly, what should we do if the building materials market has become a building materials display center? There are actually two questions here: 1. The real big data analysis platform for traditional industries; 2. How can traditional industries engage in e-commerce? Let's talk about the first question here. Under the impact of the Internet, "the building materials market has become a material display center." In this case, traditional enterprises also need to use big data to make changes. Compared with technology selection, data planning and planning are more important. Data analysis in traditional industries cannot be separated from the industry background, otherwise the data cannot be accurately implemented. For example: customer order time, customer volume, customer characteristics, age range, customer geographical distribution; (these do require specific industry backgrounds) Mining and analyzing from the above data, what refractory materials do users like? In which time period is a certain refractory material most popular? Who exactly likes a certain refractory material? And what data will be generated during the internal operation of the industry? With these questions, let's see what suitable technology platforms are there.

Personnel positioning: Not only big data technical talents are needed, but also professionals with a deep industry background and a passion for big data.

Data analysis: It can be roughly divided into business data and behavioral data. Business data (users, orders, payments, logistics...) are generally more accurate and regular. These structured data can be directly stored and analyzed using traditional databases.

Behavioral data (browsing, search history, click history, etc.) This type of data is generally large in volume, so it is best to use NoSQL (MongoDB) or store it on HDFS. Specify dimensions and indicators: cost, sales, decision, price

Data mining: Based on the set goals or indicators, predict user demand in advance, analyze user groups (those who are interested, willing to buy, and those who are unaware), regional targeting, etc.

Summary (the key point is industry data planning. If the data is not well planned, the system will be a decoration and difficult to implement, especially the formulation of goals and indicators):

Staff composition: Big data technical talents (3-5 people), talents with deep industry background (1-2 people);

Data storage: MySQL for structured business data, NoSQL or HDFS for complex behavioral data;

Analysis and tools: python, hadoop, spark;

Industry data planning: customer order time, customer volume, customer characteristics, age range, and customer geographical distribution;

Analyze goals and indicators: sales volume, cost, region, product type;

Data presentation: curve chart, bar chart, tool HighCharts;

Question 5: For data practitioners in different types of companies, the problems they deal with and the responsibilities they have to shoulder every day may be different. In large companies, each employee may have clearer tasks, while data practitioners in small and medium-sized enterprises may need to know more. Could you give an example of what are the relevant skills for data workers in different types of companies?

Dong Sibei: Regarding this question, I would like to say that as a data worker, in addition to having technical skills, you also need a little bit of "wit" in your work. If we only talk about technology, there are many textbooks and papers. In fact, just mastering these cannot distinguish you from other data workers. You also need to have data sensitivity and know how to think deeply (think from the perspective of others). Let me tell you a real case: around the end of 2014, there was a guy who was doing PC data analysis, and suddenly wanted to analyze the data on the mobile side. Then, he found that the data on the mobile side was growing every month, while his team was doing PC data business analysis. At this time, he did not stop after the results of the analysis. He thought about it day and night. Soon he had a dream. He dreamed that the PC disappeared, everyone was using mobile devices, and there was no data to analyze on the PC side... After waking up, he told his leader about his concerns. The leader asked him to analyze all the data and make a prediction according to his ideas in the dream. After that, he began to expand the mobile data business on a large scale. This guy naturally became a person related to the mobile business. Finally, under his leadership, the entire team set up a mobile data analysis team. In fact, at work, many data engineers just finish analyzing the data without doing in-depth analysis. If you can't optimize your data, you will have no data to analyze. Will you be unemployed? Even if you are familiar with all the data analysis methods, what will happen? How can data workers take the next step? In fact, mastering some data analysis methods is only the basis, and you also need your wit.

<<: Handler, Looper and MessageQueue source code analysis

>>: Common reverse engineering tools and usage tips for Android App

Is the customization cost of Yunfu rain gear mini program high? Yunfu Rain Gear Mini Program Customization Cost and Process

Blog

The estimation tool can estimate the following data [Baidu bidding]

Watching the ice tongue from the edge of the abyss is like peeking into the underworld from the edge of the Naihe Bridge.

Editor’s Note: Scientific expedition travel notes...

How much does it cost to customize the Deyang takeaway mini program? Deyang takeaway mini program customized price inquiry

There is no doubt that the topic of mini programs...

To protect the ocean, scientists will observe shrimp populations from space | Environmental Trumpet

Hello everyone, this is the 28th issue of the Env...

Big data prescription for startups | WOT Technology Clinic Second Phase Diagnosis

Is the customization cost of Yunfu rain gear mini program high? Yunfu Rain Gear Mini Program Customization Cost and Process

The estimation tool can estimate the following data [Baidu bidding]

How do Internet financial products use big data for risk control?

Can the ketogenic diet "starve" cancer cells? The list of scientific rumors in October 2024

Who is affected by the State Administration of Radio, Film and Television’s push for TVOS?

The teeth we use to eat may have been fish scales a long time ago!

APP promotion and operation: How to improve user traffic conversion

In addition to creativity, execution, and methodology, what other skills are essential for great operations?

Do wild boars eat better or worse after they move into the city?

How to operate Toutiao account? There are 3 routines!

Recommend

Farewell forever! The space "Line Walker" Voyager 1 lost contact, where will it go in the end?

Xiaohongshu platform marketing strategy!

51CTO Developer Competition Finals Roadshow + Expert Sharing

Watching the ice tongue from the edge of the abyss is like peeking into the underworld from the edge of the Naihe Bridge.

The automobile sales management method does not mean no brand

Top 10 predictions for the Internet industry in 2016: cloud computing, big data, and artificial intelligence

Does Apple's frequency reduction infringe copyright? Experts: It infringes on users' right to know and choose, etc.

UNESCO: Report on Artificial Intelligence Curriculum in Primary and Secondary Schools

This article will show you all the security mechanisms of the Android system

The fourth issue of the Dragon and Tiger Song God main promotion system small circle (September-October 2021)

Do you have to pay for a course to use DeepSeek? Beware of using new technology to sell anxiety

Why is it that no one in China can sing the lyrics of this song correctly, even though all Chinese people have sung it?

Nicotine is a drug for longevity? Beware of the pitfalls of subversive research

How much does it cost to customize the Deyang takeaway mini program? Deyang takeaway mini program customized price inquiry

To protect the ocean, scientists will observe shrimp populations from space | Environmental Trumpet