Wu Hong: Mango TV big data platform architecture and basic component optimization

On August 29, 2015, Qiniu held the two-day Qiniu Data Era Summit with the theme of "Data Reconstructing the Future" at the Shanghai International Fashion Center. The conference invited well-known data experts from home and abroad, and data masters from the Internet industry and traditional industries to attend the event, bringing a feast of data.

[[147652]]

Mango TV relied on the content and user resources of Hunan Radio and Television to grow rapidly in 2014, with weekly clicks exceeding 170 million on July 5 of the same year. This Qiniu Data Era Summit specially invited Peng Zhefu, the data director of Mango TV, to share the data processing practices of Mango TV.

The following is the transcript of the speech:

The Mango TV team started planning last year and now has ten people, more than 150 nodes, and 1.5PB of data. The overall system is divided into three business systems: Data Cube, which is responsible for the statistics of some important indicators. The second is system recommendation, which is used by Mango TV to convert and guide traffic. The third is the video content analysis system. A lot of Internet data can be converted into data required by traditional media. Therefore, Mango TV can provide some user records to directors to provide exciting content or plot development.

Now Mango TV’s data department supports 70%-80% of its business, and today’s speech is divided into three parts: the first is the basics, the second is integration, and the third is data management.

Collection is the production side of data, which determines whether the data is available. When collecting data, we will pay attention to the broadband cost. As a video company, broadband and copyright are important parts of the cost structure. Therefore, we developed an SDK to send the collected data to our custom system, and then classify it and send it to FDS, which will eventually be converted into data to form a database.

In terms of real-time computing, it is mainly used for quality monitoring during playback. We go back to ES to do some real-time queries.

During the collection process, an element will be listed and a method will be called, and then all the parameters will be sent to the service provider. However, the disadvantage is that as the number of collection points increases, the code needs to be maintained and lacks systematicity. Therefore, we made an abstraction and classified the model once during the collection process, such as page data, error data, and playback data.

Another issue is events. What are the reasons for the penalties? We integrate the name of an element and the event through the backend configuration. When the page is loaded, this configuration will be loaded into the backend page. The backend will decide what data needs to be reported and what does not need to be reported based on these loaded configurations.

If we need a long development cycle, we only need to configure the background when using this mode, and the data will come immediately. In terms of collection, we usually put a pixel image and bring some parameters behind this image, but this method will cause a very high bandwidth cost. The search bandwidth alone will occupy about 600 megabytes. In order to reduce server resources to the extreme value, we can change to PT for data tampering. This can save nearly two-thirds of the bandwidth cost.

We encountered some problems in transmission. The most important problem was that it occupied too many resources. In fact, it was not difficult to solve this problem if we conducted a specific analysis of each part.

When the amount of data is too large, we will encounter some problems, mainly because a folder will be created every once in a while, but during testing, it will take much longer. So we made an adjustment and made a very good optimization using a single-threaded method. When it comes to 1.5 and 1.6, it will directly cause the system memory to expand severely, so we usually add a configuration parameter or directly change the location.

When choosing between general type files and folders, the main consideration is efficiency. Some people have previously proposed to use files in combination when the data volume is high. When writing FTX, the file will enter a closed state, which will cause us to fail with errors, and we need to monitor it. In addition, many small files may be generated, which will cause greater pressure. Because it is big data, the data volume is self-evident. The compression method we use can compress 80% of the data volume.

In terms of queue transmission, as long as we use Kafka, in practice, the more partitions the better. If there are more partitions, the memory limit used by the client and the server will be more. One partition will generate two files, and these two files will increase the specific number. In addition, Kafka’s own mechanism - there is a page partition in Kafka, which will generate a voting process. The more partitions there are, the longer it will take, which will affect usage.

Our approach is to select a machine to create only one partition, and then test the generation and consumption situations. We are most concerned about throughput, so the maximum values of TP and TC can be used for our partitioning.

In terms of storage, we use a multi-level storage method. Of course, the problem we encounter is that when the amount of data increases, there is a lot of cold data in it, which will increase the workload. So we divide it into three levels, the main features of which are that the CPU and memory will be more abundant, and we can also reduce copies and throw cold data to cloud storage.

Another problem in storage is compression. When there is no good planning in the early stage, I find that the storage space is not enough. We will make choices based on our own business and use archive logs to organize small files.

In terms of configuration, we will integrate the configuration and push it, mainly based on the RPC control model, to control all groups. Our data service platform needs to support many businesses of the company. They only need one account to perform the transmission service and real-time computing service of our collection server. We also provide resource flow monitoring services...

Let me mainly talk about how we manage data on the platform. It can be divided into several parts. For the moment, the log types are abstracted. This is closely related to the company's business. We classify the logs into playback logs and advertising logs. There is a particularly interesting point in this. Mango TV is more concerned about core indicators such as PV and VV, but if our calculation method is different from other peers, this data will not be comparable in the industry. So we will define it from several aspects: one is its concept - the common sense of application, how to report, and what kind of results will it lead to.

***Reporting content and calculation formula. After the data arrives at the platform, the most important thing is to manage the data. Why do we need to manage it? In fact, it is to classify the data, and the resulting theme management is to classify the data again based on a certain point as the core, and this method is very similar to our log abstraction.

After the data warehouse is established, others who need to use your warehouse data need a detailed description and metadata. This metadata is divided into two categories. One is technical metadata, which is mainly used by developers, including some warehouse structures and some rules for primitive extraction. Otherwise, the data quality will be completely unsound.

***Why do we need data marts? In this process, each company will have many business departments, and each business department faces different problems. For example, from a statistical perspective, which data will be more concerned. However, if this is the case, the data warehouse will not be stable. Therefore, data marts are needed for isolation. In this process, we can extract data and queue it up and put it into our relationship cost. The results between these marts can be shared and exchanged with each other. More importantly, it is about the management and maintenance of fact tables and dimension tables.

<<: Internet people love to slap people in the face? The laid-off employees can fill up a "one acre of land"

>>: The Redmi Note, which has lost its credibility, is facing a wave of returns. Is Xiaomi's once-popularity going to decline?

Analysis of the leading players in live streaming e-commerce: Douyin, Kuaishou and Taobao!

Recommend

Why is Huawei confused? Ren Zhengfei said there is no one to lead the way

At the National Science and Technology Innovation...

Ward's Top 10 Engines of 2018 released: Japanese cars are the big winners, no German brands are on the list

In most people's impressions, Germany's a...

I took a COVID-19 antigen test after drinking Coke, and it turned out positive? It's probably because you didn't do it right

With the recent recurrence of the epidemic, many ...

He failed the teacher qualification examination many times, but he created a miracle by rewriting the high school biology textbook

He was born in poverty and lived in poverty for h...

"Dopamine" and "Maillard", can the color of your clothing adjust your mood?

Review expert: Chen Mingxin, national second-leve...

Drinking hot water from disposable paper cups will lead to the ingestion of a large amount of microplastics? Let's take a look at the research of Zhejiang University scientists

In daily life, Disposable paper cups are very con...

[Grain Policy of a Great Country] Seafood Factory in the Deep Sea: Digital Technology Continues to “Charge” the Blue Granary

When the granaries are full, the country is at pe...

Wu Hong: Mango TV big data platform architecture and basic component optimization

Analysis of the leading players in live streaming e-commerce: Douyin, Kuaishou and Taobao!

App Store launch rules, app launch, app release, app promotion

How to write good information flow copywriting ideas? 5 creative ideas to help you get inspiration quickly!

6 ways to monetize short video traffic!

Can the female anchors who earn millions a year support the future of Momo?

The cold wave is coming, and some fat people are in trouble! Why is the fat on the body not resistant to freezing?

Guide to creating a Tik Tok corporate account!

Microbes: Look, I’m generating electricity

Can WeChat’s paid reading function allow content creators to live a good life?

5G phones in 2019: expensive, battery-hungry, and not very useful

Recommend

Why is Huawei confused? Ren Zhengfei said there is no one to lead the way

Ward's Top 10 Engines of 2018 released: Japanese cars are the big winners, no German brands are on the list

I took a COVID-19 antigen test after drinking Coke, and it turned out positive? It's probably because you didn't do it right

Chicken soup is poisonous: Don’t let other people’s experience kill you when operating a public platform

How to improve new user retention through APP downloads? 6 tips!

3 million images, 15,000 zebrafish embryos, scientists achieve AI embryo recognition

Some concepts and principles you must know before doing user operations

Today is Grain Rain丨Late spring is coming to an end and early summer is coming

He failed the teacher qualification examination many times, but he created a miracle by rewriting the high school biology textbook

"Dopamine" and "Maillard", can the color of your clothing adjust your mood?

Drinking hot water from disposable paper cups will lead to the ingestion of a large amount of microplastics? Let's take a look at the research of Zhejiang University scientists

Black coffee with dark chocolate, not bitter but sweet?

WeChat Developer Guidelines

7 Christmas marketing cases worth seeing, take them and thank you!

[Grain Policy of a Great Country] Seafood Factory in the Deep Sea: Digital Technology Continues to “Charge” the Blue Granary