Working on big data every day, where do you spend your time?

After working in big data for so many years, have you ever asked yourself, what are the most labor-intensive and technically difficult aspects of big data?

I think every day. Thinking is very important. It is a process of digestion and continuous deepening. As the following sentence says:

If we have not thought about life itself since we were born, and have just followed the customs of society, then life would be meaningless. Because you haven't even thought about life.

So, have we ever thought about big data itself? What exactly does big data do? Why have I been working on big data for so many years but still cannot finish it? The essence of big data is:

With the development of science and technology, more data can be stored and analyzed. So the concept of big data came into being.

The essence of machine learning is:

As the amount of data increases, quantitative changes lead to qualitative changes. When the data is large enough, the implicit rules within it will become more and more accurate and complete. Machine learning is a technology that mines out the implicit connections that exist in data memory.

Where does big data consume the most workload?

Currently, eighty percent of the workload is on data collection, cleaning and verification. The work itself is not difficult, but it is really tedious and laborious.

We sigh every day:

Where is the data? How to collect
How to clean the data
Too much invalid data, how to remove it

What makes us frustrated is that when a new demand comes, the existing data format seems unable to meet the demand, and we have to go through the process of data collection, cleaning, and verification again in the existing data pile.

It seemed like a curse, like poor Sisyphus, who was sentenced to push a boulder up a steep mountain. Every time he exerted all his strength and the boulder was about to reach the top, it would slip from his hands and he would have to push it back again, doing endless labor.

What is the biggest technical difficulty currently encountered in big data?

It is an ad-hoc query of massive data. When Hadoop first emerged, we could use it to manipulate the increasingly cheap prices of PC servers, and a kind of violence permeated the entire ecosystem:

Because we suddenly have powerful computing power, it is like a poor person suddenly having a lot of money. We began to use powerful computing power to drive the least efficient programs to run data. This is the tragedy of the batch processing era.

But as the query efficiency requirements become higher and higher, we have to be forced to make changes. Remember that our previous logs were all simple Raw text? Now various storage formats are slowly blossoming:

Parquet, a storage technology developed by Digital Brick
ORC, a common storage format for Hive
CarbonData, a set of data formats launched by Huawei that can support PB-level

In short, we don’t seem to have found a magical technology to solve the query problem, and we can only make some compromises:

In order to speed up the query, data storage has gradually changed from early raw text to a columnar storage structure that is vectorized, indexed, and supports specific encoding and compression. Of course, this method of adjusting the storage structure will inevitably consume time and resources when data is entered.

That is, we made a compromise between storage and query.

How to make the hard labor work less

As we mentioned earlier, perhaps 80% of our work is spent on data collection, cleaning, and verification. But how do we compress this part of the work?

The answer is:

Stream computing
Streaming computing superstructure

Letting all the calculations flow makes it easy to:

We can introduce a new tributary at any point in the already flowing data. When I want to get data, what I essentially do is connect two or more nodes and transform the data between them. Just like river water, we can easily open a tributary to divert water to irrigate new farmland.

And we hope that the implementation of streaming computing combines streaming and batch semantics. Why?

Looking at Huawei's StreamCQL on Storm, we can see that real-time streaming is very limited in many cases, because in the future we will be able to do a lot more with streaming:

Data processing
Ad-Hoc Query
Machine Learning
Reports
Storage Output

This requires a certain degree of flexibility, because only on the data set can there be Ad-Hoc queries, efficient storage, and adaptation to some machine learning algorithms. In many cases, a single piece of data does not have much meaning.

I have always been a supporter of Spark Streaming.

So why do we need a streaming computing superstructure? Let's review the problem. The data ETL process is a hard job that consumes a lot of programmers ' working time. In order to reduce this time, we have two ways:

Distribute some tasks so that everyone can do them. Then, if the total amount remains the same, the number of individuals will decrease.
Improve everyone's productivity

Stream computing builds the entire foundation, and the framework on it makes the above two points possible.

<<: How to use WeChat mini program to attract customers? Analysis of online and offline operation strategies of mini programs

>>: How do mini programs make money? What are the money-making models of mini programs?

Four steps of community operation: Taking the operation of King of Glory group as an example

National New Energy Vehicle Project Team: Four points to understand the energy consumption of electric vehicles per 100 kilometers is the key to getting points

Blog

Zhihu operation full set of detailed operation guide

Working on big data every day, where do you spend your time?

Where does big data consume the most workload?

What is the biggest technical difficulty currently encountered in big data?

How to make the hard labor work less

Four steps of community operation: Taking the operation of King of Glory group as an example

Foreign media: Size and function are important for iPhone to sell well

The “User Cultivation” Model on the Internet (I)

What is the use of the row of holes on the lower left corner of the iPhone 7 without the headphone jack?

National New Energy Vehicle Project Team: Four points to understand the energy consumption of electric vehicles per 100 kilometers is the key to getting points

Zhihu operation full set of detailed operation guide

Frequently asked questions and answers about rewarded video ads!

How much does it cost to develop a movie mini app in Zigong?

After analyzing dozens of apps from major manufacturers, I summarized the layout secrets of this picture list.

Dongchedi: Xiaomi SU7 won the first place in the user reputation list of new energy vehicles in the 2024 car selection

Recommend

Is the effect of information flow advertising poor? You may need an operations plan!

China Association of Automobile Manufacturers: Economic Operation of Automobile Industry in April 2022

How to plan a successful fission marketing process design?

Kidney and leek can nourish the kidney? It's not that easy...

KOLs are being eliminated faster

How does Baidu bidding use fragmented time to quickly improve its capabilities?

How long does it take to refund the deposit from Douyin Store? What are the benefits of setting up a Douyin store?

[World Spine Day] Lower back pain, waist relief exercises can help

Android 12 is expected to allow users to share WiFi passwords through the "Nearby Share" feature

Gansu Province's TCM Prevention and Treatment Plan for Pneumonia Infected by the New Coronavirus (Trial)

User Operation: How to correctly design a paid membership system?

Bilibili barrage conquers Taobao: the strong penetration of otaku culture

Guangzhou server rental configuration price list?

Birthday wishes for African children, how much does it cost to hold up a sign to wish them a happy birthday in Africa?

How much does it cost to be an agent for Puyang women's clothing mini program?