CTO Training Camp Guo Jiangliang: Big Data Platform in Baidu Private Cloud and Open Cloud

Guo Jiangliang, R&D manager of Baidu Open Cloud Big Data Platform, shared the topic of "Big Data Platform in Baidu Private Cloud and Open Cloud" at the "CTO Training Camp Fourth Class Baidu Technology Special" hosted by 51CTO Gaozhao. The content mainly introduced the large-scale distributed computing technology in Baidu Private Cloud and the big data products and technical architecture in Baidu Open Cloud, as well as Baidu's current thoughts on open cloud + big data + industry .

At present, Baidu has accumulated some product ideas and experience in the combination of cloud computing and big data, finance and medical care. In the past few years, Baidu has been engaged in distributed storage, and in recent years, it has been engaged in distributed computing, such as Hadoop. Since 2014, Baidu has launched a public cloud business that is incubated externally. It is an open cloud similar to enterprise-level services such as Alibaba Cloud and AWS. The public cloud is Baidu's internal risk control architecture that was expanded to provide external enterprise services.

Distributed computing in private cloud

Private cloud distributed computing technology stack

On top of Matrix's resource scheduling, we do distributed computing. Distributed computing and underlying resources are equivalent to adapters. These two are combined to correspond to the community. Going up, there are various computing engines, offline computing and real-time computing. The two underlying resources are real-time computing platforms. In the middle is DCE, a computing engine similar to Hadoop. Next to it is an ELF platform, and on the far right is spark.

All of Baidu's machines have been resourced, and all resources are managed and saved offline. Now Baidu is gradually doing some online business. Because online business and offline business are different, Baidu's computer room is also separated into online computer room and offline computer room.

Why did Baidu do all its own research in the past? Because Baidu had some internal demands, and it also involved some other issues. Baidu itself is a big data company that does search, it is not just a data company. The data challenges it faces are very huge, exceeding the problems faced by the community. So Baidu also referred to some ideas from the community at the beginning, but later because the demand was large and fast, the community's ideas could not keep up, so it started to embark on the road of self-research. This is equivalent to Baidu's entire technology station.

Apart from Google, Baidu's should be the world's largest offline computing cluster, or offline computing platform. It started from Hadoop, and made a lot of C++ extensions in the middle because it had to solve many performance problems.

Baidu Offline Computing

Providing high-throughput offline computing services for Baidu
100,000+ servers, 20+ clusters, the largest single cluster size is 13,000
The average daily throughput is hundreds of petabytes, and the average number of daily jobs is over 500,000

Baidu Real-time Computing

Provide Baidu with high-efficiency computing services with millisecond-level latency
Cluster size is nearly 10,000, with over 80 application product lines
Provides a general stream join solution

Another idea is that building these platforms requires corresponding technical experts, some cluster resources and other networks, as well as costs, which are relatively high. However, if you don’t want to build them, you can choose a public cloud, such as Baidu Open Cloud.

Open cloud and big data platform

Baidu Open Cloud can be used for data applications to support R+ user data. The main objects are some apps, such as Baidu Mobile and Baidu Maps. All data is collected and processed in a unified way, so it has the support of multiple products and professional technical experts.

Baidu Open Cloud Product Overview

Big Data Processing

If there is a data demand, there will be a complete process planning, from data collection to storage, and there may be a transmission in the middle. From collection, transmission, storage, to data processing and deformation, to final data analysis and application, it is a complete flow. But the data now is different from the previous one. For example, CRM now has more and more data types from the Internet to mobile Internet, and there are high requirements for the timeliness of data. So how to quickly collect and transmit it is also a problem.

In terms of collection , Baidu faces various types of raw data with different formats, locations, storage, timeliness, etc. It collects data from heterogeneous data sources and converts it into the corresponding format to facilitate processing.

There are many requirements for storage . Some industries have special requirements, such as the genetic industry, genetic big data, and sequencing. If we want to test the genetic data of one person, we need to upload many GB, which is a large amount. There is also a timeliness requirement. For example, radio and television require the network, but radio and television also have some networks, all of which are online. In addition, there is a hard disk IP. For existing data, the hard disk is a more convenient way. Of course, there are some data security, some encryption and even protocols in it. It may be a hard disk express delivery method. After all the existing data is put up, the subsequent incremental data will slowly pass through the public network, and it can slowly and continuously increase even if the power is off. This is storage. Because it is big data processing, it must be stored first. The collected data needs to be stored in a suitable storage according to the requirements of cost, format, query, business logic, etc., so as to facilitate further analysis.

Deformation : The original data needs to be transformed and enhanced before it is suitable for analysis. For example, in web logs, IP addresses are replaced with provinces and cities, sensor data is corrected, user behavior statistics are collected, and so on.

Analysis : By organizing the data, analyze what happened, why did it happen, what is happening and what will happen, and ask more questions like this to help companies make decisions.

Baidu Open Cloud Big Data Stack

Advantages

Relying on Baidu technology, Baidu Search includes more than one trillion web pages in the world and handles billions of requests from Chinese netizens every day. Its big data technology supports more than 20 products with more than 100 million users and millions of corporate customers. In 2013, Baidu built the world's largest Hadoop cluster. In 2014, Baidu's big data processing capability BaiduSort won the championship in the international sorting competition.

Open source . Providing open source product hosting services or fully compatible products with interfaces facilitates smooth migration for Internet companies and traditional enterprises, and users do not need to worry about being bound by a specific platform or technology.

Advanced products . We strengthen open source products to make them more stable, efficient, and secure, greatly improving their maturity. Cloud-based hosting services allow users to focus on their business rather than fixing defects and operations. Our products have been tested within Baidu and are suitable for enterprise deployment in production environments.

BMR

BMR is a Hadoop/Spark hosting service. It is the first big data service in China that is fully compatible with open source Hadoop to facilitate big data processing using MapReduce, Spark, Hbase, Hive, Pig, Kafka, etc. It can create a cluster in a few minutes without worrying about node allocation, deployment, and optimization; with the help of rich examples and scenario tutorials, you can quickly get started and achieve business goals. In addition, the applicable cluster can be large or small, supports dynamic scaling, and can effectively avoid resource waste; supports the separation of computing and storage, and the cluster can be processed and stored on the BOS cloud storage service. It is fully compatible with the open source community version of Hadoop/Spark. Customers can use the open source standard API to write jobs and migrate to the cloud without any modification. Key components such as Hadoop, Spark, and Hbase in the cluster support high availability features to ensure service availability.

Applicable business scenarios include log analysis, data organization, and real-time stream processing.

Palo

PB-level online analytical processing (OLAP) engine, online reporting and multi-dimensional analysis services with advantages such as stability, efficiency and low cost. Industry-leading MPP query engine, column storage, intelligent indexing, vector execution; highly compatible with SQL standards, and provides advanced analysis functions such as in-database analysis and window functions. Multiple copies of data and metadata are stored, query services are not affected during downtime, and machine failure copies are automatically migrated. Materialized views can be established and table structures can be changed without stopping services; flexible and efficient data recovery is supported. Visual cluster management, convenient data import; support standard SQL operations.

Applicable business scenarios include online analysis, multidimensional analysis, and online reporting.

BML

A cloud-hosted distributed machine learning platform for massive data, helping customers easily use the most cutting-edge machine learning technology to obtain big data prediction and analysis capabilities. Based on Baidu's internal machine learning algorithm library (including deep learning) accumulated over the years, it is the first machine learning service in China. It provides the entire process of feature functions, model training, model evaluation and prediction services, with drag-and-drop operation. Distributed, full-memory clusters provide powerful computing power, and massive data can also be easily processed. It is equipped with multiple classification, clustering, regression, main graph models, recommendations and deep learning algorithms. It provides multiple complete solutions such as digital advertising marketing, recommendation systems, text analysis, fault detection, etc., which enable users to quickly apply machine learning technology to business systems.

Applicable business scenarios include digital advertising marketing, product and merchant recommendations, and topic and summary extraction.

It is difficult for many startups to build public cloud big data platforms now, because public cloud is a combination of data and applications, with server costs, network costs, etc., which are technically difficult and basically belong to the first batch. Of course, those within the system, such as government and enterprises, will have their own public cloud and do not use BAT. When companies with the size of BAT build public clouds or public cloud big data, their cost is all the data. Data may become more and more of an asset in the future, and it can also be said that the role of data will become increasingly important. Everyone may have their own data, and every small restaurant or small company has its own data, customer data, and operational data, which can all be used as exchange to play a role.

Guo Jiangliang believes that public cloud big data platforms have many potential opportunities, and Baidu Data and public cloud are also currently working on this. However, because Baidu is an integrated information marketplace, it is still lacking in terms of application.

<<: How to operate and maintain her community with millions of monthly active users in 8 months

>>: Develop cross-platform HTML5 applications based on LeanCloud and Wex5

2022 omni-region marketing promotion strategy!

Blog

Is the information flow optimization effect so good? Optimization skills of the information flow girl with a monthly salary of 2w+!

Blog

Black Myth wins the award! How is the most realistic physics engine in the game achieved? It's incredible...

Blog

Revealed: The memory of the mobile phone is gone while it is in use? A deep dive into the shady operations of domestic software manufacturers

Blog

Growth Strategy: How to use AB testing to evaluate and optimize activities?

Recommend

Tips for developing a big Tik Tok account that attracts a lot of fans!

So what kind of influencers have more long-term a...

Chifeng SEO training: Regularly summarizing SEO-related data indicators is an important reference system for measuring the value of one's own work

Improve SEO: Actually, it means improving the pro...

Does your phone often disconnect from the Internet, lose connection, or receive messages late? Turn on these four switches

When we use mobile phones, we often encounter suc...

CTO Training Camp Guo Jiangliang: Big Data Platform in Baidu Private Cloud and Open Cloud

2022 omni-region marketing promotion strategy!

Is the information flow optimization effect so good? Optimization skills of the information flow girl with a monthly salary of 2w+!

Black Myth wins the award! How is the most realistic physics engine in the game achieved? It's incredible...

Revealed: The memory of the mobile phone is gone while it is in use? A deep dive into the shady operations of domestic software manufacturers

Growth Strategy: How to use AB testing to evaluate and optimize activities?

NIO recalls ES8, subsidies disappear, what is the future of new energy vehicles?

10 small price setting strategies to double your conversions

The story of seed germination on Mount Everest

Zhihu's "Seven-Day Unconditional Refund" disrupts knowledge payment: Come on, let's hurt each other

How to use TikTok to acquire 300,000 customers?

Recommend

Tips for developing a big Tik Tok account that attracts a lot of fans!

Chifeng SEO training: Regularly summarizing SEO-related data indicators is an important reference system for measuring the value of one's own work

Xiaomi Note: A touchstone for Xiaomi’s entry into the mid-to-high-end market

6 elements, how to trick consumers into placing orders step by step?

The online marketing plan for Women’s Day is here!

Only mobile developers can save traditional ISV and SI companies?

Why do mosquitoes taste the water before laying eggs?

Liu Xifang's disciple Dabao's Victoria's Secret Bosu Ball Complete Aerobic Routine Course

How to operate and promote App without experience? You need to understand this knowledge

Does your phone often disconnect from the Internet, lose connection, or receive messages late? Turn on these four switches

How much does it cost to customize a large turntable mini program in Huangshan?

Tourist's hair suddenly stood up, netizens reminded: Get out of here! Experts said: Don't run around!

What happened to Weibo closing 35 illegal accounts with over one million followers? What is the specific situation?

Mobile game industry optimization methodology, classic material cases packaged for you!

After being infected with the new coronavirus, in what situations do you need to go to the hospital?