Guo Jiangliang, R&D manager of Baidu Open Cloud Big Data Platform, shared the topic of "Big Data Platform in Baidu Private Cloud and Open Cloud" at the "CTO Training Camp Fourth Class Baidu Technology Special" hosted by 51CTO Gaozhao. The content mainly introduced the large-scale distributed computing technology in Baidu Private Cloud and the big data products and technical architecture in Baidu Open Cloud, as well as Baidu's current thoughts on open cloud + big data + industry . At present, Baidu has accumulated some product ideas and experience in the combination of cloud computing and big data, finance and medical care. In the past few years, Baidu has been engaged in distributed storage, and in recent years, it has been engaged in distributed computing, such as Hadoop. Since 2014, Baidu has launched a public cloud business that is incubated externally. It is an open cloud similar to enterprise-level services such as Alibaba Cloud and AWS. The public cloud is Baidu's internal risk control architecture that was expanded to provide external enterprise services. Distributed computing in private cloud Private cloud distributed computing technology stack On top of Matrix's resource scheduling, we do distributed computing. Distributed computing and underlying resources are equivalent to adapters. These two are combined to correspond to the community. Going up, there are various computing engines, offline computing and real-time computing. The two underlying resources are real-time computing platforms. In the middle is DCE, a computing engine similar to Hadoop. Next to it is an ELF platform, and on the far right is spark. All of Baidu's machines have been resourced, and all resources are managed and saved offline. Now Baidu is gradually doing some online business. Because online business and offline business are different, Baidu's computer room is also separated into online computer room and offline computer room. Why did Baidu do all its own research in the past? Because Baidu had some internal demands, and it also involved some other issues. Baidu itself is a big data company that does search, it is not just a data company. The data challenges it faces are very huge, exceeding the problems faced by the community. So Baidu also referred to some ideas from the community at the beginning, but later because the demand was large and fast, the community's ideas could not keep up, so it started to embark on the road of self-research. This is equivalent to Baidu's entire technology station. Apart from Google, Baidu's should be the world's largest offline computing cluster, or offline computing platform. It started from Hadoop, and made a lot of C++ extensions in the middle because it had to solve many performance problems. Baidu Offline Computing
Baidu Real-time Computing
Another idea is that building these platforms requires corresponding technical experts, some cluster resources and other networks, as well as costs, which are relatively high. However, if you don’t want to build them, you can choose a public cloud, such as Baidu Open Cloud. Open cloud and big data platform Baidu Open Cloud can be used for data applications to support R+ user data. The main objects are some apps, such as Baidu Mobile and Baidu Maps. All data is collected and processed in a unified way, so it has the support of multiple products and professional technical experts. Baidu Open Cloud Product Overview Big Data Processing If there is a data demand, there will be a complete process planning, from data collection to storage, and there may be a transmission in the middle. From collection, transmission, storage, to data processing and deformation, to final data analysis and application, it is a complete flow. But the data now is different from the previous one. For example, CRM now has more and more data types from the Internet to mobile Internet, and there are high requirements for the timeliness of data. So how to quickly collect and transmit it is also a problem. In terms of collection , Baidu faces various types of raw data with different formats, locations, storage, timeliness, etc. It collects data from heterogeneous data sources and converts it into the corresponding format to facilitate processing. There are many requirements for storage . Some industries have special requirements, such as the genetic industry, genetic big data, and sequencing. If we want to test the genetic data of one person, we need to upload many GB, which is a large amount. There is also a timeliness requirement. For example, radio and television require the network, but radio and television also have some networks, all of which are online. In addition, there is a hard disk IP. For existing data, the hard disk is a more convenient way. Of course, there are some data security, some encryption and even protocols in it. It may be a hard disk express delivery method. After all the existing data is put up, the subsequent incremental data will slowly pass through the public network, and it can slowly and continuously increase even if the power is off. This is storage. Because it is big data processing, it must be stored first. The collected data needs to be stored in a suitable storage according to the requirements of cost, format, query, business logic, etc., so as to facilitate further analysis. Deformation : The original data needs to be transformed and enhanced before it is suitable for analysis. For example, in web logs, IP addresses are replaced with provinces and cities, sensor data is corrected, user behavior statistics are collected, and so on. Analysis : By organizing the data, analyze what happened, why did it happen, what is happening and what will happen, and ask more questions like this to help companies make decisions. Baidu Open Cloud Big Data Stack Advantages Relying on Baidu technology, Baidu Search includes more than one trillion web pages in the world and handles billions of requests from Chinese netizens every day. Its big data technology supports more than 20 products with more than 100 million users and millions of corporate customers. In 2013, Baidu built the world's largest Hadoop cluster. In 2014, Baidu's big data processing capability BaiduSort won the championship in the international sorting competition. Open source . Providing open source product hosting services or fully compatible products with interfaces facilitates smooth migration for Internet companies and traditional enterprises, and users do not need to worry about being bound by a specific platform or technology. Advanced products . We strengthen open source products to make them more stable, efficient, and secure, greatly improving their maturity. Cloud-based hosting services allow users to focus on their business rather than fixing defects and operations. Our products have been tested within Baidu and are suitable for enterprise deployment in production environments. BMR BMR is a Hadoop/Spark hosting service. It is the first big data service in China that is fully compatible with open source Hadoop to facilitate big data processing using MapReduce, Spark, Hbase, Hive, Pig, Kafka, etc. It can create a cluster in a few minutes without worrying about node allocation, deployment, and optimization; with the help of rich examples and scenario tutorials, you can quickly get started and achieve business goals. In addition, the applicable cluster can be large or small, supports dynamic scaling, and can effectively avoid resource waste; supports the separation of computing and storage, and the cluster can be processed and stored on the BOS cloud storage service. It is fully compatible with the open source community version of Hadoop/Spark. Customers can use the open source standard API to write jobs and migrate to the cloud without any modification. Key components such as Hadoop, Spark, and Hbase in the cluster support high availability features to ensure service availability. Applicable business scenarios include log analysis, data organization, and real-time stream processing. Palo PB-level online analytical processing (OLAP) engine, online reporting and multi-dimensional analysis services with advantages such as stability, efficiency and low cost. Industry-leading MPP query engine, column storage, intelligent indexing, vector execution; highly compatible with SQL standards, and provides advanced analysis functions such as in-database analysis and window functions. Multiple copies of data and metadata are stored, query services are not affected during downtime, and machine failure copies are automatically migrated. Materialized views can be established and table structures can be changed without stopping services; flexible and efficient data recovery is supported. Visual cluster management, convenient data import; support standard SQL operations. Applicable business scenarios include online analysis, multidimensional analysis, and online reporting. BML A cloud-hosted distributed machine learning platform for massive data, helping customers easily use the most cutting-edge machine learning technology to obtain big data prediction and analysis capabilities. Based on Baidu's internal machine learning algorithm library (including deep learning) accumulated over the years, it is the first machine learning service in China. It provides the entire process of feature functions, model training, model evaluation and prediction services, with drag-and-drop operation. Distributed, full-memory clusters provide powerful computing power, and massive data can also be easily processed. It is equipped with multiple classification, clustering, regression, main graph models, recommendations and deep learning algorithms. It provides multiple complete solutions such as digital advertising marketing, recommendation systems, text analysis, fault detection, etc., which enable users to quickly apply machine learning technology to business systems. Applicable business scenarios include digital advertising marketing, product and merchant recommendations, and topic and summary extraction. It is difficult for many startups to build public cloud big data platforms now, because public cloud is a combination of data and applications, with server costs, network costs, etc., which are technically difficult and basically belong to the first batch. Of course, those within the system, such as government and enterprises, will have their own public cloud and do not use BAT. When companies with the size of BAT build public clouds or public cloud big data, their cost is all the data. Data may become more and more of an asset in the future, and it can also be said that the role of data will become increasingly important. Everyone may have their own data, and every small restaurant or small company has its own data, customer data, and operational data, which can all be used as exchange to play a role. Guo Jiangliang believes that public cloud big data platforms have many potential opportunities, and Baidu Data and public cloud are also currently working on this. However, because Baidu is an integrated information marketplace, it is still lacking in terms of application. |
<<: How to operate and maintain her community with millions of monthly active users in 8 months
>>: Develop cross-platform HTML5 applications based on LeanCloud and Wex5
The thyroid gland is the largest endocrine organ ...
Since October, many people have seen people posti...
Introduction to the course resources within the 2...
Xueersi is one of the twin stars in the education...
What should a complete APP operation plan look li...
The full text is about 5,000 words, and it takes ...
Ford Motor's U.S. sales in May fell just 4.5%...
The international academic journal "Frontier...
[[147991]] If you want to create a worst-case sce...
During the Qingming Festival holiday, a total of ...
In the past, when music was a luxury item, Sony w...
On September 20, local time, Facebook Chief Secur...
The Ipsos poll, commissioned by fertility and gen...
An excellent copywriter can be both cute and coqu...
The report shows that the transaction volume of o...