Talk about the routines in data mining competitions and the limitations of deep learning

Talk about the routines in data mining competitions and the limitations of deep learning

Preface

Just during the summer vacation, I participated in the Zillow Prize competition on Kaggle, which allowed me to complete the transformation from a novice to Level 1 in data mining and machine learning. I would like to use the Zhihu platform to summarize my experience in the competition. I would like to state that this article is by no means an article that only lists nouns. There will be corresponding text explanations behind each point. Please feel free to use and discuss it.

Secondly, I want to emphasize that this article does not promise to take you to the top 1% of Kaggle, nor does it promise that you will understand data mining and machine learning after reading it. This summary sharing is only for the following people who have the following problems.

  • I have read a lot of other strategy articles on the Internet, but why am I still as fierce as a tiger when I see the score is 0-5?

  • Why has deep learning achieved remarkable results in recent years, but after deep learning was applied to some problems, the score became 0-6?

These two questions will be discussed as we introduce the entire process, so let's first summarize the general process. The process is nothing more than

  • Data preprocessing

  • Feature Engineering

  • Model training and selection (the possible limitations of deep learning will be discussed here)

  • Model Fusion

I discuss each of these below.

Data preprocessing

I will not discuss specific technologies here, because these technical terms are old-fashioned. Normalization and standardization are probably the first lesson in data mining. Here we will only discuss two questions: Do you know when to use what technical means for preprocessing? Do you know what the goal of preprocessing is?

According to my own understanding, the data itself determines the upper limit of the model's performance, and the model and algorithm only approximate this upper limit. This corresponds to the RIRO principle in most data mining textbooks. Therefore, the purpose of data preprocessing (and feature engineering) is to increase the upper limit of the model's approximation and improve the model's robustness.

Understanding these two goals, it is easy to infer what you need to do in the preprocessing stage.

How to improve the model's approximation limit?

  1. Feature

  2. Filtering and denoising are also required in certain areas

  3. Throw away outliers that interfere with model fitting

How to improve the robustness of the model?

  • Missing values ​​in numerical data can be handled in various ways, including default value, mean value, median value, and linear interpolation.

  • Text data can be processed at different granularities such as characters, words, and sentences.

  • It is also a common processing method to flip the image at different angles such as 90, 180, 270 before entering the model.

In general, what is needed here is to describe the data in different aspects and slightly expand the original data.

Feature Engineering

Feature engineering should include two parts: feature extraction and feature selection. There are a lot of information about feature selection on the Internet, so here we only discuss feature extraction. Why? Because a good cook cannot cook without rice. Without features, how can you select a baby?

  • The first method is called extraction based on business common sense.

To put it simply, it is to extract the prior knowledge of the fields you need to classify/predict. For example, to distinguish between boys and girls, directly using sex characteristics will definitely have a higher accuracy rate than other data such as height and weight. Of course, considering that it is impossible for everyone to be proficient in all knowledge, when facing an unfamiliar business field, it is recommended to give priority to extracting features in the form of X1/X2 (X1 and X2 here are not necessarily a variable, but may also be a formula), because the traditional statistical school particularly likes to use X1/X2 to form an indicator called *** rate, which may sometimes have miraculous effects.

  • The second method is called extracting nonlinear features.

The linear model has advantages such as simplicity and speed, but its disadvantage is also obvious. It can only express linear relationships, and there is no such simple linear relationship in general real problems. Therefore, the solution is to use nonlinear features in the linear model.

Some students may be a little confused. What is a linear model with nonlinear features? After reading the following example, it will be clear:

Assuming a regression prediction problem, the solution is y=f(x), then there are two forms of expression:

…… (1)

……(2)

Formula (1) is a linear regression, which can only express linear relationships, while formula (2) is a polynomial regression, which can express nonlinear relationships. Now we set:

, , , ,

Now, looking back at equation (2), we can get

…… (3)

Some of you may have realized that, yes, the principle of using nonlinear features in linear models is to record the nonlinear features as new variables t, and then throw them into a linear model. What you get is a linear model about t, but it can express the nonlinear relationship of x.

As for extracting nonlinear features, in addition to manually extracting such polynomials as features, the intermediate outputs of nonlinear models such as deep learning and SVM can also be used as nonlinear features. These nonlinear model methods are probably to project low dimensions to high dimensions and then insert them into linear models. As for why nonlinear models are used for extraction, one is to save manpower, and the other is because the model can often learn features that humans are not aware of in common sense. As for why nonlinear models are not used directly in an end-to-end working mode, it is because each nonlinear model has a different method for extracting nonlinear features, so the extracted nonlinear features are naturally different.

But one thing to note is that the decision of whether to use a model to extract nonlinear features must take into account the actual situation, because after the nonlinear features of multiple models are combined, or the model fusion technology is used, the amount of calculation or delay is unacceptable. Some methods may not even be feasible in engineering practice (even in a distributed computing environment), not to mention that most of the people who play Kaggle are individual users with only desktops or laptops.

Model training and selection

Reasonable division of data sets and training sets, balanced samples, cross-validation and other things are old hat, and there are a lot of them on the Internet, so this article will not continue to discuss them.

Let’s talk about the choice of model first. Machine learning is often called metaphysics. At first, I thought it was because people didn’t know the principles of the algorithms, but later I found that even if you know the principles of the algorithms, it is still metaphysics.

Why do we say that? Let's compare how other types of program errors are solved? Print method? Binary sorting to find bugs? No matter which method is used, as long as the bug can be reproduced, you can always locate the error through the program's performance, but machine learning is not like that. Except for whether the model is overfitting, which can be seen through some indicators and then the corresponding parameters can be adjusted, other problems cannot be located through the phenomenon to find the error that needs to be corrected.

For example, if the prediction effect is not good, I know that I must add features, but what features do I need to add? Does this feature need to introduce a new data dimension, or can it be extracted from the existing data? Or has the value of the current data been squeezed out by me? Or in other words, if you calculate the correlation coefficient between the feature and the predicted value, does the feature with a low correlation coefficient have to be useless? Obviously not, because you are only calculating the single variable now, not the results of the permutations and combinations. As for the results of the permutations and combinations, can you really calculate it? ... There is no 100% accurate guidance plan for such things (maybe there is not even a 90% accurate guidance plan).

Of course, there is a solution, which is to continuously pile up features and models, select the later wave of fusion, and hope that different models can learn different aspects and then complement each other.

When it comes to action, it’s just like the old saying: make bold assumptions and carefully verify them.

Having said so much, it simply means that machine learning models may be theoretical science, but in practice, it must be experimental science. Since it is an implementable science, we must abide by a principle: simple first and then complex. Premature optimization is the root of all evil, and mindless optimization is a waste of time. Simple refers to the simplicity of the model. For numerical types, you can start with simple linear regression (if it is in the field of images, you can choose some relatively basic DL models, such as the pre-trained vgg series). In this way, the speed of getting results is definitely better than other messy and complex models. Fast results mean that you can determine a more appropriate test method offline as early as possible, your baseline, quickly verify your other ideas, and then slowly optimize and adjust parameters.

At the end of this section, let’s talk about my thoughts on using deep learning in this competition.

It is generally recognized that the situation that is not suitable for using deep learning is when the amount of data is relatively small, but generally speaking, the amount of data provided by the contestants is very considerable, so the condition of data amount should be met.

The second generally recognized inappropriateness is that the data does not have local correlation characteristics. This means that in fields such as images/natural language, there is local correlation between data. For example, one pixel cannot express enough information, but a bunch of pixels can indicate whether it is a puppy or a kitten. In language, sentences are composed of words, and once the order of these things is disrupted, the overall information expressed is also disrupted. These data with local correlation characteristics can be extracted through a certain network topology, and at the same time, the depth can be used to extract hierarchical features, thereby achieving relatively excellent results. Comparing the difference in the effects of three-layer MLP and CNN in minst handwritten digit recognition can fully illustrate this point. For data that does not have local correlation characteristics, it is impossible to capture its information with a characteristic network topology. In deep learning, only MLP can be used to complete the training of the model, and the effect of MLP is generally weaker than traditional models such as GDBT and RandomForest.

Compared with traditional methods, the biggest advantage of deep learning is automatic feature extraction. However, according to the experience shared by other experts, when artificial feature engineering reaches a certain level, traditional models can surpass deep learning. For details, see the video of the yesofcourse team in the Quora similar text matching competition, which includes a comparison between deep learning and traditional machine learning models. In addition, some people pointed out that deep learning can learn very well when the relationship between x and y is a single reasoning. If it is a two-layer or multi-layer reasoning, deep learning is completely at a disadvantage compared to traditional models. However, deep learning combined with knowledge graphs can effectively solve this problem.

(Personally, I guess the reasoning that the expert is talking about here should refer to the logical relationship, such as what you should call your father's mother's second uncle's wife's sister, rather than the mathematical relationship y=f(x))

The actual situation in this competition is more interesting. After basic processing of the data set (filling missing values ​​and removing outliers), without any feature extraction, I stuffed it into xgboost and 3-layer 128-unit MLP respectively. The results of the LB and offline performance of the two models were very close, and the difference was basically 6 to 7 decimal places. But later I manually extracted a few features, and then the difference between the two models began to emerge. There was not much difference offline, but on LB, xgboost made my ranking rise significantly, but MLP made my LB score have a very significant... drop! At first I thought it was overfitting, and later I tried to add dropout and regularization terms, but there was no improvement. I also tried to use autoencoder for feature extraction, and then applied it to the traditional model, but the effect was not satisfactory. Finally, I decided not to use the DL method, but to use the traditional machine learning model + manual feature extraction instead. But before I decided to give up DL, I had wasted too much time trying (because I had blind confidence in DL and always thought that the poor results were due to my own parameter problems). There is no silver bullet, and we must analyze the specific situation. This is also something we should pay attention to in the future.

In fact, through analysis, I found that the reason why DL produces more serious errors is probably that the label is actually half positive and half negative, but the output of the model is almost all positive. Even if it is a negative number, the value of the predicted value is very small, and no suitable solution has been found.

But no matter xgboost or linear regression, their prediction values ​​can each be 50% positive and 50% negative. Therefore, the current solution is to take a weighted average of the final results of the two models. At present (2017.7.22), the ranking has reached 342/1702. According to the routine, there should be a wave of model fusion next. I am confident that I can rise to a certain position, but because I still have to take the postgraduate entrance examination this year, and I feel that I have made enough progress through this competition experience, I have no plans to continue. Even if I continue, it will be after the postgraduate entrance examination.

Model Fusion

Regarding model fusion, there are still a lot of methods on the Internet, and there has not been much innovation in recent years. The main thing to say is that you should draw an error curve before model fusion to determine whether there is a need for fusion between models. For example, if one model is completely better than another model, then the fusion value is very low.

Afterword

Reading ten thousand books is not as good as traveling ten thousand miles; traveling ten thousand miles is not as good as meeting countless people; meeting countless people is not as good as being guided by an expert. Participating in competitions open to the whole society is actually more meaningful than participating in some school competitions that I don’t want to mention the name of. Although it is difficult to win awards, because you can be exposed to processes that are close to the real world, people of different levels and different industry backgrounds, especially after knowing a friend, you will know more awesome friends. These opportunities are of great benefit to a person’s knowledge, skills and network accumulation. If you are still in school, the potential value of this opportunity will double.

Finally, because this competition is still ongoing, I will not give you the code, but if you really want to be a freeloader, there are many kernels on Kaggle that are ranked higher than mine. If you just want to discuss the problem, you are welcome to leave a message in the comment area or private message, the text should be enough.

Finally, because I don't know what the research level of other schools in China is in machine learning or deep learning except for some top schools (I don't care much about the sub-directions), candidates in Guangdong area are welcome to recommend schools and laboratories.

, thank you in advance, I promise not to eat you for the recommendation. :P

<<:  A Brief Discussion on iOS Crash (Part 2)

>>:  Comprehensive understanding of reinforcement learning from concepts to applications

Recommend

Please, don't put big ears and antlers on your cycling helmets!

Do you need to wear a helmet when riding a motorc...

How does your product explode in the market?

There is a saying circulating on the Internet : &...

Is Linux right for you?

Linux is not for everyone -- for many users, Wind...

C4D product performance first issue [HD quality with material]

C4D product performance first issue [HD quality w...

Python skills courses that professionals must learn

This course is from NetEase Cloud Micro Classroom...

Save your life! What should I do if I get frostbite in Harbin?

Review expert: Zhang Yuhong, chief physician of t...

Why is frozen shrimp so unpalatable? You are missing these steps

Shrimp pre-cooking is troublesome, so fully proce...