An article to help you understand big data mining technology!

An article to help you understand big data mining technology!

If big data is to generate value, its processing process is undoubtedly very important, among which big data analysis and big data mining are the two most important parts. In the previous issues of popular science, the editor has introduced the relevant situation of big data analysis. In this issue, the editor will explain the big data mining technology to everyone, so that everyone can easily understand what big data mining technology is.

What is big data mining?

Data mining is the process of extracting implicit, unknown but potentially useful information and knowledge from massive, incomplete, noisy, fuzzy and random data.

Data mining objects

According to the information storage format, the objects used for mining include relational databases, object-oriented databases, data warehouses, text data sources, multimedia databases, spatial databases, temporal databases, heterogeneous databases, and the Internet.

Data Mining Process

Define the problem : Clearly define the business problem and determine the purpose of data mining.

Data preparation : Data preparation includes: data selection - extracting the target data set for data mining from large databases and data warehouse targets; data preprocessing - reprocessing the data, including checking the integrity and consistency of the data, removing noise, filling in missing fields, deleting invalid data, etc.

Data mining : Select the corresponding algorithm based on the type of data function and the characteristics of the data, and perform data mining on the purified and transformed data sets.

Result analysis : Interpret and evaluate the results of data mining and convert them into knowledge that can ultimately be understood by users.

 

Data mining classification

Direct data mining : The goal is to use the available data to build a model that describes the remaining data and a specific variable (which can be understood as an attribute of a table in a database, i.e. a column).

Indirect data mining : No specific variable is selected in the target and described by the model; instead, a certain relationship is established among all the variables.

Data Mining Methods

Neural Network Methods

Neural networks have attracted more and more attention in recent years because of their good robustness, self-organizing adaptability, parallel processing, distributed storage and high fault tolerance, which makes them very suitable for solving data mining problems.

Genetic Algorithms

Genetic algorithm is a random search algorithm based on biological natural selection and genetic mechanisms, and is a bionic global optimization method. The implicit parallelism and easy combination with other models of genetic algorithms make them widely used in data mining.

Decision Tree Method

Decision tree is an algorithm commonly used in predictive models. It classifies a large amount of data purposefully to find some valuable, potential information from it. Its main advantages are simple description, fast classification speed, and it is particularly suitable for large-scale data processing.

Rough Set Method

Rough set theory is a mathematical tool for studying imprecise and uncertain knowledge. The rough set method has several advantages: no additional information is required; the expression space of input information is simplified; the algorithm is simple and easy to operate. The object of rough set processing is an information table similar to a two-dimensional relational table.

Covering positive examples and excluding negative examples

It searches for rules by covering all positive examples and excluding all negative examples. First, select a seed from the positive example set and compare it one by one in the negative example set. If the selector is compatible with the field value, it will be discarded; otherwise, it will be retained. By looping through all positive example seeds in this way, we can obtain the rules for positive examples (the conjunction of selectors).

Statistical analysis methods

There are two types of relationships between database field items: functional relationship and correlation relationship. Statistical methods can be used to analyze them, that is, using statistical principles to analyze the information in the database. It can perform common statistics, regression analysis, correlation analysis, difference analysis, etc.

Fuzzy set methods

That is, using fuzzy set theory to conduct fuzzy judgment, fuzzy decision-making, fuzzy pattern recognition and fuzzy cluster analysis on practical problems. The higher the complexity of the system, the stronger the fuzziness. Generally, fuzzy set theory uses membership degree to characterize the both-and-this nature of fuzzy things.

 

Data mining tasks

Association analysis

The existence of a certain regularity between the values ​​of two or more variables is called correlation. Data association is an important type of discoverable knowledge in databases. Associations are divided into simple associations, temporal associations, and causal associations. The purpose of association analysis is to find hidden association networks in the database. Generally, the two thresholds of support and credibility are used to measure the relevance of association rules. Parameters such as interest and relevance are constantly introduced to make the mined rules more in line with the needs.

Cluster analysis

Clustering is to group data into several categories according to their similarities. Data in the same category are similar to each other, and data in different categories are different. Cluster analysis can establish macro concepts, discover data distribution patterns, and possible relationships between data attributes.

Classification

Classification is to find a conceptual description of a category, which represents the overall information of this type of data, that is, the connotation description of the category, and use this description to construct a model, which is generally represented by rules or decision tree patterns. Classification is the process of obtaining classification rules using a training data set through a certain algorithm. Classification can be used for rule description and prediction.

predict

Forecasting is to use historical data to find out the patterns of change, build models, and use this model to predict the types and characteristics of future data. Prediction is concerned with precision and uncertainty, which is usually measured by prediction variance.

Timing Mode

A temporal pattern is a pattern with a high probability of recurrence that is found through time series search. Like regression, it uses known data to predict future values, but the difference in these data is the time at which the variables are located.

Deviation Analysis

There is a lot of useful knowledge included in the deviation. There are many anomalies in the data in the database. It is very important to discover the anomalies in the data in the database. The basic method of deviation testing is to find the difference between the observed results and the reference.

Mobile application product promotion services: ASO optimization services Qinggua Media information flow

This article is compiled and published by (APP Top Promotion). Please indicate the author information and source when reprinting!

<<:  Guangyuan SEO Training: How to expand SEO keywords? Will search engines accept it?

>>:  The sales data analysis ideas you must sort out before "Double 11"

Recommend

Yanse "Food Photography Tutorial Video" Online Course No. 19

Training course content: The teacher has 10 years...

How should operations build a data analysis system?

A classmate asked: I often hear requirements such...

60 lessons, zero-based course to improve competitiveness with PPT

I would like to recommend to you the PPT tutorial...

How to use data analysis to acquire customers at low cost?

This article is organized as follows: How to exte...

Marketing's key: consumer attention

Nowadays, it is difficult for consumers to concen...

E-commerce market trends and product selection in June!

Which categories are hot-selling in various chann...

The sales data analysis ideas you must sort out before "Double 11"

The case is this: An e-commerce company that sell...

How to get the new version of AppStore traffic bonus before iOS11 is released

Introduction: For APP promotion and operation per...

Tik Tok Promotion: The Secret of Tik Tok’s Recommendation Algorithm!

Algorithms are an indispensable evaluation mechan...

How do K12 online education companies build their own distribution systems?

The current situation of online education compani...