An article to help you understand big data mining technology!

If big data is to generate value, its processing process is undoubtedly very important, among which big data analysis and big data mining are the two most important parts. In the previous issues of popular science, the editor has introduced the relevant situation of big data analysis. In this issue, the editor will explain the big data mining technology to everyone, so that everyone can easily understand what big data mining technology is.

What is big data mining?

Data mining is the process of extracting implicit, unknown but potentially useful information and knowledge from massive, incomplete, noisy, fuzzy and random data.

Data mining objects

According to the information storage format, the objects used for mining include relational databases, object-oriented databases, data warehouses, text data sources, multimedia databases, spatial databases, temporal databases, heterogeneous databases, and the Internet.

Data Mining Process

Define the problem : Clearly define the business problem and determine the purpose of data mining.

Data preparation : Data preparation includes: data selection - extracting the target data set for data mining from large databases and data warehouse targets; data preprocessing - reprocessing the data, including checking the integrity and consistency of the data, removing noise, filling in missing fields, deleting invalid data, etc.

Data mining : Select the corresponding algorithm based on the type of data function and the characteristics of the data, and perform data mining on the purified and transformed data sets.

Result analysis : Interpret and evaluate the results of data mining and convert them into knowledge that can ultimately be understood by users.

Data mining classification

Direct data mining : The goal is to use the available data to build a model that describes the remaining data and a specific variable (which can be understood as an attribute of a table in a database, i.e. a column).

Indirect data mining : No specific variable is selected in the target and described by the model; instead, a certain relationship is established among all the variables.

Data Mining Methods

Neural Network Methods

Neural networks have attracted more and more attention in recent years because of their good robustness, self-organizing adaptability, parallel processing, distributed storage and high fault tolerance, which makes them very suitable for solving data mining problems.

Genetic Algorithms

Genetic algorithm is a random search algorithm based on biological natural selection and genetic mechanisms, and is a bionic global optimization method. The implicit parallelism and easy combination with other models of genetic algorithms make them widely used in data mining.

Decision Tree Method

Decision tree is an algorithm commonly used in predictive models. It classifies a large amount of data purposefully to find some valuable, potential information from it. Its main advantages are simple description, fast classification speed, and it is particularly suitable for large-scale data processing.

Rough Set Method

Rough set theory is a mathematical tool for studying imprecise and uncertain knowledge. The rough set method has several advantages: no additional information is required; the expression space of input information is simplified; the algorithm is simple and easy to operate. The object of rough set processing is an information table similar to a two-dimensional relational table.

Covering positive examples and excluding negative examples

It searches for rules by covering all positive examples and excluding all negative examples. First, select a seed from the positive example set and compare it one by one in the negative example set. If the selector is compatible with the field value, it will be discarded; otherwise, it will be retained. By looping through all positive example seeds in this way, we can obtain the rules for positive examples (the conjunction of selectors).

Statistical analysis methods

There are two types of relationships between database field items: functional relationship and correlation relationship. Statistical methods can be used to analyze them, that is, using statistical principles to analyze the information in the database. It can perform common statistics, regression analysis, correlation analysis, difference analysis, etc.

Fuzzy set methods

That is, using fuzzy set theory to conduct fuzzy judgment, fuzzy decision-making, fuzzy pattern recognition and fuzzy cluster analysis on practical problems. The higher the complexity of the system, the stronger the fuzziness. Generally, fuzzy set theory uses membership degree to characterize the both-and-this nature of fuzzy things.

Data mining tasks

Association analysis

The existence of a certain regularity between the values of two or more variables is called correlation. Data association is an important type of discoverable knowledge in databases. Associations are divided into simple associations, temporal associations, and causal associations. The purpose of association analysis is to find hidden association networks in the database. Generally, the two thresholds of support and credibility are used to measure the relevance of association rules. Parameters such as interest and relevance are constantly introduced to make the mined rules more in line with the needs.

Cluster analysis

Clustering is to group data into several categories according to their similarities. Data in the same category are similar to each other, and data in different categories are different. Cluster analysis can establish macro concepts, discover data distribution patterns, and possible relationships between data attributes.

Classification

Classification is to find a conceptual description of a category, which represents the overall information of this type of data, that is, the connotation description of the category, and use this description to construct a model, which is generally represented by rules or decision tree patterns. Classification is the process of obtaining classification rules using a training data set through a certain algorithm. Classification can be used for rule description and prediction.

predict

Forecasting is to use historical data to find out the patterns of change, build models, and use this model to predict the types and characteristics of future data. Prediction is concerned with precision and uncertainty, which is usually measured by prediction variance.

Timing Mode

A temporal pattern is a pattern with a high probability of recurrence that is found through time series search. Like regression, it uses known data to predict future values, but the difference in these data is the time at which the variables are located.

Deviation Analysis

There is a lot of useful knowledge included in the deviation. There are many anomalies in the data in the database. It is very important to discover the anomalies in the data in the database. The basic method of deviation testing is to find the difference between the observed results and the reference.

Mobile application product promotion services: ASO optimization services Qinggua Media information flow

This article is compiled and published by (APP Top Promotion). Please indicate the author information and source when reprinting!

<<: Guangyuan SEO Training: How to expand SEO keywords? Will search engines accept it?

>>: The sales data analysis ideas you must sort out before "Double 11"

How to buy real fans on Douyin? 24-hour Douyin self-service ordering platform!

Marketing strategy for June: Children’s Day, College Entrance Examination, and Dragon Boat Festival, how to leverage marketing opportunities?

Blog

A guide to designing splash screen ads for the gaming industry!

Blog

The difference in brand marketing between KFC and McDonald’s!

Blog

I spent 600 yuan in the morning and didn’t get any conversations, but my creative idea had a high click-through rate. How can I solve this problem?

Recently, I have received many messages from frie...

An article to help you understand big data mining technology!

How to buy real fans on Douyin? 24-hour Douyin self-service ordering platform!

Tsinghua University teacher Pan Pan's learning motivation

Feng Yaozong SEO course, Feng Yaozong SEO video tutorial training course

Xiaohongshu promotion and operation: Xiaohongshu live broadcast monetization!

Case analysis: How to use selling point marketing for products?

How does Tik Tok short video operate?

2019 Douyin operation and promotion complete guide, recommended collection

Marketing strategy for June: Children’s Day, College Entrance Examination, and Dragon Boat Festival, how to leverage marketing opportunities?

A guide to designing splash screen ads for the gaming industry!

The difference in brand marketing between KFC and McDonald’s!

Recommend

The latest transfer technology on September 9, transfer intact, no editing, full Douyin operation, no Douyin block

Yanse "Food Photography Tutorial Video" Online Course No. 19

How should operations build a data analysis system?

60 lessons, zero-based course to improve competitiveness with PPT

How to use data analysis to acquire customers at low cost?

Mimi Meng is so expensive, and vivo signed a one-year contract. How else can brands and self-media cooperate?

I spent 600 yuan in the morning and didn’t get any conversations, but my creative idea had a high click-through rate. How can I solve this problem?

Marketing's key: consumer attention

E-commerce market trends and product selection in June!

Live streaming is no match for short videos: Live streaming is not as promising as short videos

The sales data analysis ideas you must sort out before "Double 11"

How to get the new version of AppStore traffic bonus before iOS11 is released

iPhone 12 release date confirmed? iPhone 12 launch to be delayed by several weeks

Tik Tok Promotion: The Secret of Tik Tok’s Recommendation Algorithm!

How do K12 online education companies build their own distribution systems?