【51CTO.com original article】 1. Big Data Framework Structure and Panoramic Overview It seems that overnight, Big Data has become the most fashionable word in the IT industry.
Big Data First of all, big data is not something completely new. Google's search service is a typical application of big data. According to the needs of each user, Google quickly finds the most likely answer from the world's massive digital assets (or digital garbage) in real time and presents it to you. This is the most typical big data service. It's just that in the past, there were too few data processing and commercially valuable applications of this scale, and the concept was not formed in the IT industry. Now, with the global digitalization, broadband networks, and the application of the Internet in all walks of life, the accumulated data volume is getting bigger and bigger. More and more companies, industries, and countries have found that they can use similar technologies to better serve customers, discover new business opportunities, expand new markets, and improve efficiency. Only then has the concept of big data gradually formed. Two examples to understand big data: 1. Stock speculation. In 2011, Hollywood produced a high-IQ movie called Limitless, which tells the story of a down-and-out writer named Cooper who took a magical blue drug that could quickly improve his intelligence, and then used this high IQ to speculate in stocks. So how did Cooper speculate in stocks? He was able to master countless company information and backgrounds in a short period of time, that is, he dug out the massive amount of data that already existed in the world (including company financial reports, TV news, newspapers from the past two or three decades, the Internet, gossip, etc.), connected them together, and even dug out the massive social data from Facebook and Twitter to obtain the general public's emotional inclination towards a certain stock. Through the mining and analysis of massive amounts of information, all inside information is no longer inside information. 2. Flight delays. As we all know, there are many flight delays in China, while in comparison, flights in the United States are much more punctual. Is that because China has more smog than the United States, and the number of days with bad weather is much higher? Of course not. Among them, a good practice of the US aviation control agency has played a positive role. It is also very simple to say that the United States will regularly publish the delay rate and average delay time of each airline and each flight in the past year, so that when customers buy tickets, they will naturally choose flights with high punctuality rates, thereby driving each airline to work hard to improve punctuality through market means. This simple method is more direct and effective than any management means (such as the Chinese government's macro-control measures). Judging from these cases, big data is not a magical thing. In the management of enterprises, industries and countries, usually less than 20% of data (or even less) is effectively used. If the value of the remaining 80% of dormant data is stimulated, what will the world become? Of course, it will be better and beyond your imagination. Individual data has no value, but as more and more data accumulate, quantitative changes will lead to qualitative changes. However, no matter how much data there is, if it is blocked or not used, it is worthless. Therefore, we need to integrate and connect massive amounts of data to extract huge commercial value from it. Big data is the next wave of applications in the in-depth development of the Internet and a natural extension of the development of the Internet. At present, it can be said that the development of big data has reached a critical point, which is why it has become one of the hottest words in the IT industry. Next, let's talk about the technical architecture of big data. When talking about big data, we have to talk about Hadoop. Of course, big data is not just about Hadoop. Let's take a closer look at the overall framework structure and panoramic overview of big data. Hadoop is a software framework that can perform distributed processing on large amounts of data. Hadoop originally came from a programming model package called MapReduce from Google. Google's MapReduce framework can decompose an application into many parallel computing instructions and run very large data sets across a large number of computing nodes. A typical example of using this framework is the search algorithm running on network data. Hadoop[3] was originally only related to web page indexing, and quickly developed into a leading platform for analyzing big data. Figure 1 File system - data management - business computing - analysis tools This picture is analyzed from bottom to top: A Basic file system All cluster servers are installed with the Ubuntu distribution of Linux, and the files are based on Ext4 and NFS by default. Distributed file management uses the HDFS/HADOOP framework, which is the standard configuration of big data systems and will not be introduced in detail. B. Data Management In terms of form, big data sources are mainly structured and unstructured (mainly text). A variety of systems are used to manage and retrieve all data. Cassandra: Completes storage, retrieval and computing support for all structured big data (basic source data). It can be easily expanded to support hundreds of billions of data in the future. Compared with the common HBase solution, we chose Cassandra because it has more advantages in reliability (no central structure), community update support and cooperation with Spark, which is more suitable for HCR business conditions. Postgre/MySQL: Open source relational databases that store intermediate statistical results and business data. Although Cassandra is available, traditional relational databases are still important in the data system: the large amount of intermediate calculations/statistical results required for researcher analysis are more suitable for relational database storage, and its multi-field retrieval capability (which is difficult for Cassandra) is very important for multi-dimensional analysis. The cluster deployment + partitioning mode makes it easy to handle billions of data. Infobright: A structured data warehouse solution with powerful data compression and aggregate statistics capabilities. The free community version used has good performance within tens of billions of data and is very suitable for multi-dimensional statistical analysis and in-depth drilling scenarios under structured big data. Elastic Search: manages and retrieves all unstructured data (mainly unstructured business data and Internet data). The distributed architecture supports tens of billions of data sets well, is easy to manage and use, and has rich other extension resources (such as Cassandra plug-ins). C Business Calculation Business computing is the core of the big data technology system, including support for all business logic computing/analysis. There are many things here, but I will focus on two of them. Distributed computing system: Spark is used (Hadoop/MR is not used). Compared with the latter, Spark is more advanced, lightweight and efficient (especially when there is a lot of machine learning processing in the business), code development is fast, and the requirements for personnel are unified, which are all what we are concerned about. Its submodule Spark Sql can quickly implement SQL-like retrieval and analysis of big data, and has stronger performance and functions than Hive. The machine learning algorithms provided by the related machine learning library MLib are widely used in mining and processing in the business, which is much faster than Mahout under Hadoop. All of them effectively support business processing and analysis. Data stream support tools: Kettle is a classic ETL tool used for fast ETL processing when various source data are introduced, and its visual interface is easy to use. Kafka's data subscription mechanism uniformly meets the sharing requirements of multiple upper-level business models for topic data streams. Storm: A streaming computing framework that is used to meet the needs of future real-time analysis services. Currently, there are not many practical examples online. D Analysis Tools Analysis tools are the top layer of the HCR big data technology system, mainly composed of various tools to support data researchers at all levels to complete rapid exploration of big data. A panoramic overview of big data, no need to say more after looking at the picture. The three layers of IAAS, PAAS and SAAS of the cloud are further decomposed and reconstructed. Figure 2 Big Data Panorama Platform 2. Enterprise big data scenarios and integration of different data sources The analysis and solution of big data problems are often complex. If you have spent time researching big data enterprise solutions, then you must know that it is really not a simple task. In order to simplify the complexity of various types of big data, we generally classify various parameters based on the source to provide a logical and clear architecture for the various layers and high-level components involved in any big data solution. The following is an overall structure diagram of a platform-based enterprise acquiring data from data sources - cleaning and integration - data analysis and processing - data application services. Figure 3 Platform-based Enterprises From the red column on the left, we can see that in addition to the internal structured data from production, sales, service, after-sales, etc., the current data sources of enterprises also include internal unstructured data including social media and other data sets, as well as data and log information generated by external rich media, etc. So how to integrate different data sources in the face of different application scenarios? Figure 4 Metadata As shown in the figure above, we can see that: The first step is to establish big data standards (standard definitions of data, coding, and attributes of business information). Secondly, pay attention to data exchange and integration between heterogeneous data in accordance with certain specifications. Then, share and publish data resources in a hierarchical and decentralized manner, and open and use compliant data to the outside world. At the same time, maintain the sharing and unification of core data. Finally, make big data a part of corporate assets and realize the application value of data. Collection and integration of different data sources: There are many data collection systems on the market, and the most widely used ones are sqoop, logstash, and flume. Sqoop is generally used to import data from relational databases to HDFS; logstash is generally used in conjunction with elasticsearch and kibana; and the most widely used and powerful one is flume. Flume is a distributed, reliable and available system that efficiently collects, aggregates and migrates large amounts of data from different data sources to a centralized data storage, and uses transaction-based data delivery to ensure the reliability of event delivery. Let's talk about how to use the collected data in conjunction with a customer churn analysis system for mobile business. Figure 5 Figure 6 After the data enters the Kafka cluster from the access system, it will enter the JStorm cluster for real-time processing and the YARN and HDFS clusters for offline processing respectively. For real-time processing, we need high stability and response speed. We chose to build a separate jstorm cluster to meet our real-time processing needs. On the one hand, a separate jstorm cluster is easier to maintain and reduces the problem of affecting the stability of the real-time system due to resource contention; on the other hand, jstorm also supports our needs for real-time computing at any time granularity. For the offline computing platform, we chose YARN and HDFS. We built support for different data computing engines on top of YARN, including Spark, Map-Reduce, and Kylin for OLAP. By combining these different computing engines, we can meet our needs for data processing in all aspects. Finally, by analyzing the results based on the user's mobile Internet access behavior, a detailed list of user interests and hobbies is formed, which can be used to conduct behavioral analysis of customer churn, facilitate personalized recommendations and instant and accurate advertising, and continuously reduce customer disgust and complaints, and increase customer stickiness. At the end of this chapter, let me summarize my views on big data. First, big data enables enterprises to truly change from being self-centered to being customer-centered. Enterprises are created for customers and their purpose is to make profits for shareholders. Only by serving customers well can profits be made. But in the past, many enterprises were unable to be customer-centric because the amount of information about the corresponding customers was not large, the mining was insufficient, and the system did not support it. The use of big data can restore the business objects of enterprises from a rough summary of customers (the so-called refined and summarized "customer groups") to real customers one by one, so that the business is targeted, the service to customers is better, and the investment efficiency is higher. Second, big data will, to a certain extent, overturn the traditional management methods of enterprises. The management methods of modern enterprises are from top to bottom, relying on hierarchical organizations and strict processes, relying on the collection and convergence of information to make correct decisions, and then through the transmission and decomposition of decisions in the organization, as well as the standardization of processes, ensuring that the decisions are implemented, ensuring that every business activity has quality assurance, and ensuring a certain degree of risk avoidance. This is actually a useful but clumsy way. In the era of big data, we may reconstruct the management methods of enterprises. Through the analysis and mining of big data, a large number of businesses can make decisions by themselves, without having to rely on bloated organizations and complex processes. Third, another major role of big data is to change business logic and provide the possibility of getting answers directly from other perspectives. Nowadays, people's thinking or corporate decision-making is actually dominated by a kind of logical force. We conduct research, collect data, summarize, and finally form our own inferences and decision-making opinions. This is a business logic process of observation, thinking, reasoning, and decision-making. The formation of logic for people and organizations requires a lot of learning, training, and practice, and the cost is very huge. But is this the only way? Big data gives us other options, which is to use the power of data to get answers directly. Fourth, through big data, we may have a completely new perspective to discover new business opportunities and reconstruct new business models. We now look at the world, for example, analyzing food spoilage at home, mainly relying on our eyes and our experience, but if we have a microscope, we can see bad bacteria at once, and the analysis will be completely different. Big data is our microscope, which allows us to discover new business opportunities from a completely new perspective and possibly reconstruct business models. The establishment of a churn model depends on the quality of customer attributes (customer master data such as birthday, gender, location and income) and the customer's social behavior and usage preferences. First, select Flume as the collection system for different data sources. Use Flume to collect data, store it in multiple storage components at the same time, and provide it to the computing layer as a data source. 3. Big data learning and high-paying job hunting Personal skills: Familiar with LINUX, shell, Python, C/C++, Java Familiar with relevant algorithms and technologies, commonly used machine learning models, natural language processing, and data mining methods Love to learn, love to think, insist on using all your strength High-paying job search: 1. Experience is important When asked what the company requires when hiring IT personnel with Hadoop skills, HR answered, "Experience is the most important." When you are eager to find a data-related job, you have entered the big data job market - although you will accumulate experience one day, but now is the most important. If you have the necessary experience, then you should show it appropriately. 2. Understand the common terms used in recruitment In resumes, CVs and other job application documents, some professional words and terms often attract the attention of recruiters and HR managers. Here are three representative examples: designing and building scalable distributed data stores, systems and pipelines at scale; implementing a Hadoop cluster with xxxx nodes; building from scratch or from the bottom up. 3. Continue to pay attention to the evolving industry environment Hadoop is a relatively young technology in China, just like the entire big data industry. Therefore, if you want to develop well in the big data industry or related industries, it is particularly important to pay attention to the latest industry trends and changes in a timely manner. Keep an eye on data conferences, such as the WOT Data Summit and the Hadoop Technology Conference. These data conferences are very helpful in understanding the cutting-edge technology of the industry, paying attention to the latest developments, and clarifying personal development directions. Q&A 1. Hunan-Xiao Liu-Java: How much is the monthly salary for working in big data? Teacher Sun: Starting at 15K, and advanced ones at 40k-50K per month. 2. Anhui-Mei Xuan-Python: Teacher, I want to find a job in machine learning, but I have no work experience and I haven’t graduated yet. It’s a bit difficult to find a job in this field directly. Should I go to other places to accumulate work experience and project experience first? Teacher Sun: Yes, you need to participate in a complete project and then it will be easy to accumulate experience. Anhui-Mei Xuan-Python: I learned Python for the first time, but it didn’t work for me. Then I learned machine learning algorithms, but it didn’t work for me. Then I got familiar with some cluster distribution and so on. I wanted to use the tensorflow framework, but found that my computer couldn’t handle it. 3. Zhengzhou-Li Sai explored on his own. Does more practice count as experience? After all, it is difficult for an individual to get a large amount of data, and a small amount of data is not enough. Teacher Sun: Yes, but it may be due to the lack of real scenarios, so there are not many problems. Personal big data is indeed not much. If you have 5,000 friends in your WeChat circle of friends, managing and mining this information every day can also be considered personal big data. 4. Guangzhou-Diqin dQ-PHP: After reading the first four viewpoints on big data, they are all related to corporate business operations. Personally, there seems to be little connection. Where should we start to implement the project? Teacher Sun: Start with Hadoop, understand its core, and then expand to other 5. Hunan-Xiao Liu-Java: Are there any requirements for the version of Python to learn when working in big data? Is Python 2.x or 3.x used in big data? Teacher Sun: The version is updated quickly, and now it is basically version 3.x. TensorFlow is a machine learning platform released by Google last year. It has higher requirements for machine configuration. It is recommended to choose a high-end laptop 6. Guangzhou-Diqin dQ-PHP: Any suggestions or study routes? Can Figure 1 be used as a reference? About the formulation of study plans. Teacher Sun: Yes, this diagram is very comprehensive. Start with LINUX to lay the foundation and then move forward. 7. Guangzhou - Fatty - Database: To learn big data well, you need to be proficient in Java and database, as well as Linux and shell. It is not enough to just know these four. The more solid your learning, the more beneficial it will be in the future. 8. Guangzhou-Fatty-Database: Looking horizontally, each layer is an industry or position Guangzhou-Diqin dQ-PHP: Judging from that picture, you can’t say you can do it without a few years of hard work... It seems that this is not a direct recruitment position. You have to work hard in the original position for a few years before you have a chance. 9. Guangzhou-yuliya Operations: I would like to ask how to obtain the underlying data? Secondary screening based on search. Teacher Sun: There are many ways to obtain the underlying data, such as network sniffing, protocol packet capture, function callback, etc. Secondary screening based on search: [51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites] |
<<: 10 open source task management and time tracking apps for Android
>>: Thoughts and experience in Android development software architecture
There is nothing strange in the world! First, Li ...
In the past few years, there are some things that...
Presumably, many friends who are engaged in onlin...
With the popularity of "Nezha: The Devil Boy...
On March 30, the new energy vehicle battery facto...
Kuaishou - "Record the world, record you&quo...
Recently, in the mother river Tiaoxi in Yuhang, H...
1. In March this year, Changan Automobile achieve...
Review expert: Wu Xi, deputy chief physician of t...
The rise of local self-media is something that is...
There was a cause for this matter. Because Lao Mi...
In January 2022, my country's automobile indu...
Paying for beauty has become a daily routine for ...