I have to complain that those who work with big data also need to build data warehouses. And whether it is a traditional industry or an Internet company, they all need to pay attention to data warehouses, rather than just saying that they are great because they work with big data. Data warehouses represent more of a way to manage and use data. It is a complete theoretical system including ETL, scheduling, and modeling. The so-called big data now is more about the increase in data volume and the update of tools. There is no conflict between the two, on the contrary, it is a better combination. In other words, simply using Hadoop, Spark, and Flume to process data is actually just learning a few new tools. This is for tool development and is only a part of ETL in the data warehouse. Of course, technological updates can often lead to changes in an era, such as the birth of Hadoop. It takes a lot of time and energy to study a big data component in depth. But after the craze cools down, we should consider how to better manage and use our data. For data practitioners, it is important to always keep up with technological changes, but remember that data is king. When pursuing the ultimate in technology, don't forget that we are working in data. Article topic That’s all. This article mainly explains an important part of the data warehouse: how to design data stratification! For other content about the data warehouse, please refer to other articles on the data warehouse. The discussion of data stratification in this article is suitable for the following scenarios. If you are interested in scenarios beyond this scope or are an expert with rich experience in data warehouse, you don’t need to waste time reading this article. - Data construction has just started, and most of the data is directly connected to the business after rough data access.
- When data construction develops to a certain stage, it is found that the use of data is disorganized and various businesses are calculated directly from the original data.
- Various repeated calculations seriously waste computing resources and require performance optimization.
Article Structure When I first started working on data warehouses, I encountered many pitfalls. Due to my limited resources, I felt that there were few successful data warehouse experiences in the Internet industry when I came into contact with data warehouses. It was difficult to find more practical information on the Internet. And those classic books were too theoretical, which was really a living hell. Fortunately, I have passed that hurdle now, so I spend more time to organize my thoughts and help other friends avoid some pitfalls. - Why do we need to stratify? This question has been questioned by several students. Therefore, the value of stratification should be made clear.
- Let me share the classic data layering model, as well as the function of each layer of data and how it is processed.
- This article shares two data layer designs and uses these two practical examples to illustrate how to store data in each layer.
- Here are some suggestions, not the best, but they can be used as a reference.
Why layering? One of the main reasons why we stratify data is that we hope to have a clearer control over the data when managing it. In detail, there are mainly the following reasons: - Clear data structure: Each data layer has its scope, so we can locate and understand it more easily when using the table.
- Data lineage tracking: To put it simply, we can understand that what we ultimately give to business integrity is a business table that can be used directly, but it comes from many sources. If a source table has a problem, we hope to be able to quickly and accurately locate the problem and understand its scope of harm.
- Reduce duplicate development: Standardizing data stratification and developing some common middle-layer data can greatly reduce duplicate calculations.
- Simplify complex problems. Break down a complex task into multiple steps to complete, and each layer only handles a single step, which is simpler and easier to understand. It is also convenient for maintaining data accuracy. When data problems occur, you do not need to repair all the data, but only need to start from the problematic step.
- Shield the original data from anomalies.
- Shield the impact on the business and do not need to re-access the data every time the business is changed.
The dependencies between tables in a data system are like the flow of wires, and we all hope that they are neat and easy to manage. However, the final result is often the first picture, not the second picture. [[195994]] [[195995]] How to layer theory From a theoretical perspective, we can divide the data warehouse into the following three layers: data operation layer, data warehouse layer, and data product layer. - ODS stands for Operational Data Store.
The data operation layer, also called the ODS layer, is the closest layer to the data in the data source. The data in the data source is extracted, cleaned, and transmitted, that is, after the legendary ETL, it is loaded into this layer. The data in this layer is generally classified according to the classification method of the source business system. For example, this layer may contain a series of original business data, such as population table (including each person's ID number, name, address, etc.), airport boarding record (including passenger ID number, flight number, boarding date, departure city, etc.), UnionPay card swiping information table (including bank card number, card swiping location, card swiping time, card swiping amount, etc.), bank account table (including bank card number, cardholder ID number, etc.), etc. Here we can see that the data at this level also has distinct characteristics of business databases, and even has a certain organizational form of data paradigms in relational databases. However, the data at this level is not equivalent to the original data. When the source data is loaded into this layer, it is necessary to perform a number of tasks such as denoising (for example, removing bank card information that is obviously deviated from the normal level), deduplication (for example, bank account information and public security bureau population information both contain people's names, but only one copy is kept), extracting dirty data (for example, if someone's bank card is stolen, there are two card swipe records in China and Japan within ten minutes, which is dirty data), business extraction, unit unification, field cutting (for example, fields used to support the front-end system work but not required in data mining), business identification, etc. - The data warehouse layer (DW) is the main body of the data warehouse
Here, the data obtained from the ODS layer is used to build various data models according to the theme. For example, in a data set with the theme of studying people's travel consumption, the boarding information of airlines and the card swiping records of the UnionPay system can be combined for analysis to generate a data set. Here, we need to understand four concepts: dimension, fact, index, and granularity. - Data product layer (APP), which provides result data for data products
Here, the data is mainly provided for data products and data analysis. It is usually stored in systems such as es and mysql for online systems, or in Hive or Druid for data analysis and data mining. For example, the report data we often talk about, or the large wide table, is usually stored here. Technical Practice These three layers of technology are relatively coarse-grained, and we will break them down in detail later. Before that, let's talk about how the data at each layer generally flows. Here we will only briefly introduce a few commonly used tools, focusing on the mainstream in the open source community. - Data source layer–> ODS layer
This is actually a major battlefield where our big data technology plays a role. Our data mainly comes from two major sources: - Business database, here we often use sqoop to extract, for example, we extract it once a day. In terms of real-time, you can consider using canal to monitor mysql binlog and access it in real time.
- For point-of-care logs, the online system will enter various logs. These logs are generally saved in the form of files. We can choose to use Flume to extract them regularly, or we can use Spark Streaming or Storm for real-time access. Of course, Kafka will also play a key role.
- Other data sources are more diverse, which is related to specific business and will not be elaborated here.
Note: At this level, it is not just about simple data access, but also about data cleaning, such as abnormal field processing, field naming standardization, time field unification, etc. These are usually easily overlooked, but they are crucial. This will be very useful when we automatically generate various features later. There will be articles to share later. There are mainly two types: - Daily scheduled tasks: For example, our typical daily computing tasks include calculating the data of the previous day in the early morning and looking at the report in the morning. This type of task often uses Hive, Spark or MR programs to calculate, and the final results are written to Hive, Hbase, MySQL, Es or Redis.
- Real-time data: This part is mainly used by various real-time systems, such as our real-time recommendations and real-time user portraits. Generally, we use Spark Streaming, Storm or Flink for calculation, and finally fall into Es, Hbase or Redis.
For example The original design was divided into 6 layers in total, and after removing the metadata, there are still 5 layers. The following is an analysis of one of the original design ideas. Buffer - Concept: Also known as the interface layer (stage), it is used to store daily incremental data and change data, such as the business change log received by Canal.
- Data generation method: directly receive source data from Kafka, and the business table needs to be generated every day. Update, delete, insert data, only generate the business table of insert data, and the data is directly entered into the detail layer.
- Discussion plan: Only put the canal log directly into the buffer layer. If there are other businesses with zipper data, they should also be put into the buffer layer.
- Log storage method: Use Impala external table and parquet file format to facilitate reading of data that needs to be processed by MR.
- Log deletion method: long-term storage, only the data of the last few days can be stored. Discussion plan: direct long-term storage.
- Table schema: Partitions are generally created by day.
- Naming of the library and table. Library name: buffer, table name: the initial format considered is: buffer_date_business table name, to be determined.
Detail layer (ODS, Operational Data Store, DWD: data warehouse detail) - Concept: It is the detailed data layer of the data warehouse, which precipitates the STAGE layer data and reduces the complexity of extraction. At the same time, the information model organization of ODS/DWD mainly follows the form of enterprise business transaction processing, centralizes various professional data, and the granularity of the detail layer is consistent with that of the stage layer, which is a public resource for analysis.
- Data generation method: Some data comes directly from Kafka, and some data is a combination of interface layer data and historical data. The method of synthesizing data from canal logs is under study.
- Discussion plan: The synthesis method of canal data is: every day, the full data of the detail layer the day before yesterday and the new data of yesterday are synthesized into a new data table, overwriting the old table. At the same time, historical mirroring is used to store a historical mirror in the new table on a weekly/monthly/yearly basis.
- Log storage method: Direct data uses Impala external tables in parquet file format. Canal synthetic data is secondary generated data, so it is recommended to use internal tables. The following layers are all data generated from Impala, so it is recommended to use internal tables + static/dynamic partitions.
- Log deletion method: long-term storage.
- Table schema: Generally, partitions are created by day. If there is no concept of time, partition fields are selected according to specific business.
- Naming of the library and table. Library name: ods, table name: the initial format is ods_date_business table name, to be determined.
- Old data update method: direct overwrite.
Mild aggregation layer (MID or DWB, data warehouse basis) - Concept: The light summary layer is a transitional layer between the DWD layer and the DM layer in the data warehouse. It performs light integration and summary statistics on the production data of the DWD layer (it can include complex cleaning and processing, such as session data generated from PV logs). The main difference between the light integration layer and DWD is that the two have different application fields. The data of DWD comes from the production system and is not accumulated to meet some unforeseen needs; the light integration layer performs fine-grained statistics and accumulation for analytical applications.
- Data generation method: The detail layer generates a light summary table according to certain business requirements. The data that requires complex cleaning and MR processing in the detail layer are also processed and connected to the light summary layer.
- Log storage method: internal table, parquet file format.
- Log deletion method: long-term storage.
- Table schema: Generally, partitions are created by day. If there is no concept of time, partition fields are selected according to specific business.
- Naming of the library and table. Library name: dwb, table name: the initial format considered is: dwb_date_business table name, to be determined.
- Old data update method: direct overwrite.
Theme layer (DM, date market or DWS, data warehouse service) - Concept: Also known as data mart or wide table. According to business divisions, such as traffic, orders, users, etc., a wide table with more fields is generated to provide subsequent business queries, OLAP analysis, data distribution, etc.
- Data generation method: generated by calculation of light summary layer and detail layer data.
- Log storage method: Use Impala internal table and Parquet file format.
- Log deletion method: long-term storage.
- Table schema: Generally, partitions are created by day. If there is no concept of time, partition fields are selected according to specific business.
- Naming of database and table. Database name: dm, table name: the initial format is: dm_date_business table name, to be determined.
- Old data update method: direct overwrite.
Application Layer (App) - Concept: The application layer is the result of the statistics of the previous three layers of data based on business needs. It can be directly provided for query display or imported into MySQL for use.
- Data generation method: generated by the detail layer, light summary layer, and data mart layer. Generally, data is required to come mainly from the mart layer.
- Log storage method: Use Impala internal table and Parquet file format.
- Log deletion method: long-term storage.
- Table schema: Generally, partitions are created by day. If there is no concept of time, partition fields are selected according to specific business.
- Naming of the library and table. Library name: tentatively named apl. Depending on the business, there is no restriction on having only one library.
- Old data update method: direct overwrite.
How to be more elegant The design mentioned above is relatively detailed, but there may be a little too many levels, and there may be some confusion when distinguishing where a table should be stored. In this chapter, we will design a data warehouse hierarchy, and add dimension tables and some temporary tables to make our solution more elegant. In the figure below, we have made some minor changes. We have removed the Buffer layer in the previous section, put the data mart layer and the light summary layer on the same level, and separated the dimension table and temporary table. Here we explain the functions of DWS, DWD, DIM and TMP. - DWS: A light aggregation layer that makes a preliminary summary of user behavior from the ODS layer, abstracts some common dimensions: time, IP, ID, and makes some statistical values based on these dimensions, such as the number of products purchased by users in different login IPs in each time period. A light aggregation layer here will make the calculation more efficient. On this basis, if the calculation is only for 7 days, 30 days, and 90 days, it will be much faster. We hope that 80% of the business can be calculated through our DWS layer instead of ODS.
- DWD: This layer mainly solves some data quality and data integrity issues. For example, user information comes from many different tables, and there are often problems such as delay and data loss. In order to facilitate each user to better use the data, we can do a shielding at this layer.
- DIM: This layer is relatively simple. For example, the country code and country name, geographic location, Chinese name, national flag image and other information are stored in the DIM layer.
- TMP: Each layer of calculation will have many temporary tables. A DWTMP layer is specially set up to store the temporary tables of our data warehouse.
Summarize Data stratification is a very important part of the data warehouse. It determines not only a layer problem, but also directly affects a series of subsequent constructions such as lineage analysis, automatic feature generation, and metadata management. Therefore, it is appropriate to consider it as early as possible. In addition, you don’t have to worry too much about the name of each layer, just name it according to your preference. This article shares the author’s understanding and thoughts on data warehouses. It may not be completely accurate. If you have any questions, please feel free to communicate with me. It is estimated that there will be three more topics for sharing in the data warehouse: lineage analysis, automatic feature generation, and metadata management. After the sharing is completed, the data warehouse related topics will come to an end. refer to: 1. Data Warehouse 2. Data Warehouse Toolbox 3. Winston and Ruby’s guidance |