Aiti Tribe Story Collection (33): "Xiaobai" will show you how to play data analysis in minutes

Aiti Tribe Story Collection (33): "Xiaobai" will show you how to play data analysis in minutes

[51CTO.com original article] Data Xiaobai started as a rookie in the workplace, then became a "cousin", and now is still climbing mountains and crossing rivers in the data field. What force makes her so fascinated by data? How did she become a senior data analysis engineer? Let's listen to the following to find out.

Xiaobai is a girl born after 1985, she is frank, optimistic and curious. Because her last name is Bai, her friends all call her Xiaobai. It is because of her curiosity that she started her journey as a data novice.

[[213353]]

Xiaobai·Senior Data Analysis Engineer

Xiaobai graduated from a second-rate university with a major in applied mathematics and statistics. During his time at school, he learned some statistics-related knowledge, read some books related to statistics, and easily wrote two papers published in journals in the field of statistics. After graduation, he entered the workplace with a data heart and joined the ranks of Beijing drifters. His first job was in an education company, doing data analysis in the education industry with a strong scientific research atmosphere. After that, he worked in e-commerce, education, consulting, and traditional manufacturing companies. From the initial workplace rookie, to a "cousin", to big data products, he is still climbing mountains and ridges in the data field until now.

How can a newbie become stronger when entering the workplace?

Once, Xiaobai, a data scientist, received a call from a headhunter recommending an algorithm engineer position in an Internet finance company. Xiaobai's restless heart began to stir, so she made an appointment and walked into the Internet finance company, where she met an unkempt and unruly interviewer. Xiaobai was confused by the interviewer's questions, but there was one question that interested Xiaobai (to add: Xiaobai's algorithm ability is weak, and he can only skillfully call the algorithm package in the R open source software). That is the simplicity and complexity of the model that everyone has been entangled with, and how to improve the accuracy of the model and solve the overfitting problem caused. Taking this opportunity, she would like to share with you some of her superficial understanding.

Model complexity and overfitting

Understanding of model simplicity and complexity: A model that is too simple may result in inaccurate classification or inaccurate prediction. In the current environment where the quality of large data is generally low, some people will pursue the complexity of the model to make up for the defects in data quality, which will eventually increase the complexity of the algorithm model. But is a model with high complexity necessarily a good model?

Let’s take a look at a picture (pictures speak louder than words)

This picture comes from the Statistics City Forum

The horizontal axis of the above figure is model complexity, and the vertical axis is prediction error; it clearly shows that as the model complexity increases, the prediction error of the training set gradually decreases until it is close to 0 (the fit is great, the error is infinitely close to 0, there are flowers here, there are applause here), etc. The error of the test set does not seem to be perfect, why is it so high? This is the so-called "overfitting" phenomenon. Therefore, the choice of model is not the more complex the better, but to choose the best model (of course, in a certain model set), and the best model is the one that performs well in indicators such as prediction error on new data.

How to solve the "overfitting" problem of the model? Regularization! Regularization! Regularization! (Important things should be said three times) There is also a nonsense: having more and wider-dimensional high-quality data is better than a good model. In the era of big data, data is king! ! !

Common pitfalls: data definition and data understanding

Data understanding is an essential ability for every data analyst, but data definition is missing in some extensive management in the past. At this time, if an analyst uses experience to understand the data, no matter how rigorous and advanced the subsequent analysis and algorithm model are, it will be greatly discounted or even futile.

Let me tell you something: NASA launched a Mars weather probe in 1998, and the probe lost contact in 1999. The reason was human factors. The flight system software on the Mars Climate Orbiter used the metric unit Newton to calculate the thruster power, while the direction correction and thruster parameters entered by the ground personnel used the imperial unit pound force, which caused the probe to enter the atmosphere at an incorrect height and eventually disintegrate and break.

Let me give you an example from my work experience: the business scenario requires visualization of some business indicators. During the data flow calculation process, the leader suddenly discovered that something was wrong. Why were my performance indicators missing? Who should take the blame? The data workers!!! Then, you can imagine that all the people with the title of "data" started to check the logic, no problem! Storage tasks, no problem! Data synchronization time point, no problem! The leader rolled his eyes, saying it must be wrong. How could the data be wrong if there was no problem? In his heart, he might have already started to ask about your emotional intelligence and IQ, which were both seriously injured. What was the final reason? The length of a business field in the production database became longer. You can understand it as if a 15-digit ID card number suddenly began to slowly become 18 digits.

How to get out of this pit? There is a high-end term that is developing in a field now: data governance, and DAMA certification, which is very valuable; a low-level approach is for a small data analyst to chase after business personnel and ask for advice shamelessly. Don't think your questions are stupid, maybe you used to be so stupid.

Predicting Individuals Based on Group Conclusions of Data

The previous pitfall is the analyst's misunderstanding of the business, and the following pitfall is the business person's misunderstanding of the analysis conclusion.

Let me give you another example. About two years ago, big data was a god-like existence. Now, people's understanding of it has begun to diverge. Some people underestimate the power of big data. Why? Many people have spent money but have not heard any results. Some people are still obsessed with big data research, while some people have begun to objectively realize that the power of big data is not achieved overnight. It is not that you can master big data by setting up a big data XXX department and hiring a few engineers. Today, I will not dwell on big data, but just give an example of data analysis, which is also a project experience of Xiaobai.

Project purpose : To analyze the secondary consumption behavior of online education students (the secondary marketing of e-commerce is so successful, and the cost of developing old customers is so much lower than that of developing new customers). It is a very meaningful project, but please note: it is academic online education.

Project process : In the project, various data experts showed their skills in cleaning data, building models, model testing, and model accuracy evaluation. Looking at this series of moves, the final conclusion is: the model test accuracy is more than 90% (historical data shows that more than 90% of students no longer register), but the existing data feature indicators only explain 10% of the model. What does this mean? What kind of indicators do we need to supplement? The direction is not clear and it is also involved, so we give up.

Late stage of the project : The leader comes out again, the project has come to a conclusion, let's predict which student can come back to study again? Just tell me who can come. Is the data analyst a little speechless? Whether he does it or not, he will be the scapegoat. How to get out of this pit? Let the torrent of time take away everything!!!

【Written at the end】

The above are some of my personal superficial understandings. The process of data exploration is sometimes boring, and sometimes it is exciting and fulfilling to see the objective existence through the data. I hope to grow together with all of you who are tirelessly exploring in the field of data and technology, and I wish 51CTO will get better and better.

If you are also willing to share your story, please join the 51CTO developer QQ exchange group 669593076 and contact the group owner Xiaoguan. We look forward to your wonderful story!

[51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites]

<<:  Google just released a tool to crack into your iPhone

>>:  China Mobile takes drastic measures to eliminate WeChat! The new version of Fetion offers free data traffic

Recommend

You don’t have any of these, and you say you can write code?

[[129460]] Have you heard of SEMA? It's a pre...

iPad mini has become a "useless" but still indispensable

On October 19, the US technology website Cnet pub...

4 tips to teach you how to operate WeChat groups and QQ groups!

Few people know the purpose of group operations ,...

Ford calls for equal treatment in the UK against favouring Nissan

Recently, Ford Europe CEO Jim Farley will meet wi...

After using iOS 15, I am sure Apple has no solution

After looking forward to it for a long time, Appl...