Wu Hong: Mango TV big data platform architecture and basic component optimization

Wu Hong: Mango TV big data platform architecture and basic component optimization

On August 29, 2015, Qiniu held the two-day Qiniu Data Era Summit with the theme of "Data Reconstructing the Future" at the Shanghai International Fashion Center. The conference invited well-known data experts from home and abroad, and data masters from the Internet industry and traditional industries to attend the event, bringing a feast of data.

[[147652]]

Mango TV relied on the content and user resources of Hunan Radio and Television to grow rapidly in 2014, with weekly clicks exceeding 170 million on July 5 of the same year. This Qiniu Data Era Summit specially invited Peng Zhefu, the data director of Mango TV, to share the data processing practices of Mango TV.

The following is the transcript of the speech:

The Mango TV team started planning last year and now has ten people, more than 150 nodes, and 1.5PB of data. The overall system is divided into three business systems: Data Cube, which is responsible for the statistics of some important indicators. The second is system recommendation, which is used by Mango TV to convert and guide traffic. The third is the video content analysis system. A lot of Internet data can be converted into data required by traditional media. Therefore, Mango TV can provide some user records to directors to provide exciting content or plot development.

Now Mango TV’s data department supports 70%-80% of its business, and today’s speech is divided into three parts: the first is the basics, the second is integration, and the third is data management.

Collection is the production side of data, which determines whether the data is available. When collecting data, we will pay attention to the broadband cost. As a video company, broadband and copyright are important parts of the cost structure. Therefore, we developed an SDK to send the collected data to our custom system, and then classify it and send it to FDS, which will eventually be converted into data to form a database.

In terms of real-time computing, it is mainly used for quality monitoring during playback. We go back to ES to do some real-time queries.

During the collection process, an element will be listed and a method will be called, and then all the parameters will be sent to the service provider. However, the disadvantage is that as the number of collection points increases, the code needs to be maintained and lacks systematicity. Therefore, we made an abstraction and classified the model once during the collection process, such as page data, error data, and playback data.

Another issue is events. What are the reasons for the penalties? We integrate the name of an element and the event through the backend configuration. When the page is loaded, this configuration will be loaded into the backend page. The backend will decide what data needs to be reported and what does not need to be reported based on these loaded configurations.

If we need a long development cycle, we only need to configure the background when using this mode, and the data will come immediately. In terms of collection, we usually put a pixel image and bring some parameters behind this image, but this method will cause a very high bandwidth cost. The search bandwidth alone will occupy about 600 megabytes. In order to reduce server resources to the extreme value, we can change to PT for data tampering. This can save nearly two-thirds of the bandwidth cost.

We encountered some problems in transmission. The most important problem was that it occupied too many resources. In fact, it was not difficult to solve this problem if we conducted a specific analysis of each part.

When the amount of data is too large, we will encounter some problems, mainly because a folder will be created every once in a while, but during testing, it will take much longer. So we made an adjustment and made a very good optimization using a single-threaded method. When it comes to 1.5 and 1.6, it will directly cause the system memory to expand severely, so we usually add a configuration parameter or directly change the location.

When choosing between general type files and folders, the main consideration is efficiency. Some people have previously proposed to use files in combination when the data volume is high. When writing FTX, the file will enter a closed state, which will cause us to fail with errors, and we need to monitor it. In addition, many small files may be generated, which will cause greater pressure. Because it is big data, the data volume is self-evident. The compression method we use can compress 80% of the data volume.

In terms of queue transmission, as long as we use Kafka, in practice, the more partitions the better. If there are more partitions, the memory limit used by the client and the server will be more. One partition will generate two files, and these two files will increase the specific number. In addition, Kafka’s own mechanism - there is a page partition in Kafka, which will generate a voting process. The more partitions there are, the longer it will take, which will affect usage.

Our approach is to select a machine to create only one partition, and then test the generation and consumption situations. We are most concerned about throughput, so the maximum values ​​of TP and TC can be used for our partitioning.

In terms of storage, we use a multi-level storage method. Of course, the problem we encounter is that when the amount of data increases, there is a lot of cold data in it, which will increase the workload. So we divide it into three levels, the main features of which are that the CPU and memory will be more abundant, and we can also reduce copies and throw cold data to cloud storage.

Another problem in storage is compression. When there is no good planning in the early stage, I find that the storage space is not enough. We will make choices based on our own business and use archive logs to organize small files.

In terms of configuration, we will integrate the configuration and push it, mainly based on the RPC control model, to control all groups. Our data service platform needs to support many businesses of the company. They only need one account to perform the transmission service and real-time computing service of our collection server. We also provide resource flow monitoring services...

Let me mainly talk about how we manage data on the platform. It can be divided into several parts. For the moment, the log types are abstracted. This is closely related to the company's business. We classify the logs into playback logs and advertising logs. There is a particularly interesting point in this. Mango TV is more concerned about core indicators such as PV and VV, but if our calculation method is different from other peers, this data will not be comparable in the industry. So we will define it from several aspects: one is its concept - the common sense of application, how to report, and what kind of results will it lead to.

***Reporting content and calculation formula. After the data arrives at the platform, the most important thing is to manage the data. Why do we need to manage it? In fact, it is to classify the data, and the resulting theme management is to classify the data again based on a certain point as the core, and this method is very similar to our log abstraction.

After the data warehouse is established, others who need to use your warehouse data need a detailed description and metadata. This metadata is divided into two categories. One is technical metadata, which is mainly used by developers, including some warehouse structures and some rules for primitive extraction. Otherwise, the data quality will be completely unsound.

***Why do we need data marts? In this process, each company will have many business departments, and each business department faces different problems. For example, from a statistical perspective, which data will be more concerned. However, if this is the case, the data warehouse will not be stable. Therefore, data marts are needed for isolation. In this process, we can extract data and queue it up and put it into our relationship cost. The results between these marts can be shared and exchanged with each other. More importantly, it is about the management and maintenance of fact tables and dimension tables.

<<:  Internet people love to slap people in the face? The laid-off employees can fill up a "one acre of land"

>>:  The Redmi Note, which has lost its credibility, is facing a wave of returns. Is Xiaomi's once-popularity going to decline?

Recommend

The cold air is coming. Are you wearing your long johns correctly?

A new round of cold air is coming! The Central Me...

14 common problems and solutions for bidding promotion!

In bidding promotion , we often encounter some co...

Why did the once popular breast enhancement advertisement disappear overnight?

Watching TV is no longer a lifestyle habit for mo...

Universal formula for online event promotion and marketing planning!

In this article, the author will tell you about h...

Website login URL submission entry for major search engines

After a website is completed, the next task is to...

Case review: e-commerce user growth hacking methodology

User operations in e-commerce involve many points...

5 tips to improve APP retention rate

Attracting users to download is the first priorit...

MTI Translation Master Wu F System Course

課程目錄: ├──2021武F32講| ├──01 | | ├──2020 英語筆譯考試“套路”三十...