"Google's self-driving cars and robots get a lot of attention, but our real future lies in the technology that makes computers smarter and more human, machine learning." — Eric Schmidt (CEO of Google) As computing moves from mainframes to personal computers to the cloud, we are probably in the most critical period in human history, not because of what has been achieved, but because of the progress and achievements we will make in the next few years. For me, what is most exciting nowadays is the popularization of computing technology and tools, which has brought about the spring of computing. As a data scientist, I can build a data processing system to perform complex algorithm calculations and earn a few dollars per hour. But learning these algorithms took me countless days and nights. So who can benefit most from this article? This article may be the most valuable one I have ever written. The purpose of writing this article is to help those who are interested in data science and machine learning to avoid the hassles of learning algorithms. I will give some examples of machine learning problems in the article, and you can also get inspiration in the process of thinking about solving these problems. I will also write down some personal understanding of various machine learning algorithms and provide R and Python execution codes. After reading this article, readers can at least take action and try to write a machine learning program by hand. However, this article does not explain the statistical principles behind these algorithms. Sometimes it is also a good way to learn from practice. If you want to understand these statistical principles, then the content of this article may not be suitable for you. Generally speaking, there are three types of machine learning algorithms: 1. Supervised Learning Supervised learning algorithms consist of a target variable (dependent variable) and predictor variables (independent variables) that are used to predict the target variable. With these variables, we can build a model so that for a known value of the predictor variable, we can get the corresponding value of the target variable. The model is repeatedly trained until it reaches a predetermined accuracy on the training data set. Algorithms that belong to supervised learning include: regression model, decision tree, random forest, K-nearest neighbor algorithm, logistic regression, etc. 2. Unsupervised Learning Unlike supervised learning, in unsupervised learning we do not have a target variable that needs to be predicted or estimated. Unsupervised learning is used to classify objects in general. It is widely used to classify customers according to a certain indicator. Algorithms that belong to unsupervised learning include: association rules, K-means clustering algorithm, etc. 3. Reinforcement Learning This algorithm can be used to train a program to make a decision. The program tries all possible actions in a situation, records the results of different actions and tries to find the best one to make a decision. Algorithms that belong to this category include Markov decision processes. Common Machine Learning Algorithms The following are the most commonly used machine learning algorithms that can solve most data problems: 1. Linear Regression 2. Logistic Regression 3. Decision Tree 4. Support Vector Machine (SVM) 5. Naive Bayes 6. K-nearest neighbor (KNN) 7. K-means algorithm 8. Random Forest 9. Dimensionality Reduction Algorithms 10. Gradient Boost and Adaboost algorithms 1. Linear Regression Linear regression uses continuous variables to estimate actual values (such as house prices, number of calls, and total sales). We use the linear regression algorithm to find the best linear relationship between the independent variable and the dependent variable, and we can determine the best straight line on the graph. This best straight line is the regression line. This regression relationship can be expressed as Y=aX+b. We can imagine a scenario to understand linear regression. For example, if you ask a fifth grader to line up the classmates in the class from lightest to heaviest without asking them how much they weigh, what will the child do? He may line up by observing everyone's height and physique. This is linear regression! The child actually thinks that height and physique are somehow related to a person's weight. And this relationship is like the relationship between Y and X in the previous paragraph. In the formula Y=aX+b: Y - Dependent variable a - slope X - Independent variable b-intercept a and b can be obtained by minimizing the sum of squares of the dependent variable errors (least squares method). The linear regression equation we get in the figure below is y=0.2811X+13.9. Through this equation, we can get a person's weight information based on his height. There are two main types of linear regression: univariate linear regression and multivariate linear regression. Univariate linear regression has only one independent variable, while multivariate linear regression has multiple independent variables. When fitting multivariate linear regression, you can use polynomial regression or curvilinear regression. Python code #Import Library #Import other necessary libraries like pandas, numpy… fromsklearn importlinear_model #Load Train and Test datasets #Identify feature and response variable(s) and values must be numeric and numpy arrays x_train=input_variables_values_training_datasets y_train=target_variables_values_training_datasets x_test=input_variables_values_test_datasets #Create linear regression object linear =linear_model.LinearRegression() # Train the model using the training sets and check score linear.fit(x_train,y_train) linear.score(x_train,y_train) #Equation coefficient and Intercept print('Coefficient: \n',linear.coef_) print('Intercept: \n',linear.intercept_) #Predict Output predicted=linear.predict(x_test) R code #Load Train and Test datasets #Identify feature and response variable(s) and values must be numeric and numpy arrays x_train <-input_variables_values_training_datasets y_train <-target_variables_values_training_datasets x_test <-input_variables_values_test_datasets x <-cbind(x_train,y_train) # Train the model using the training sets and check score linear <-lm(y_train ~.,data =x) summary(linear) #Predict Output predicted=predict(linear,x_test) 2. Logistic Regression Don't be confused by its name, logistic regression is actually a classification algorithm rather than a regression algorithm. It usually uses known independent variables to predict the value of a discrete dependent variable (such as binary values 0/1, yes/no, true/false). In simple terms, it predicts the probability of an event by fitting a logit function. So it predicts a probability value, and naturally, its output value should be between 0 and 1. Again, we can use an example to understand this algorithm. Suppose a friend of yours asks you to answer a question. There are only two possible outcomes: you get it right or you don't. In order to study the areas of questions you are best at, you do questions in various fields. Then the results of this study may be like this: if it is a tenth-grade trigonometry question, you have a 70% chance of being able to solve it. But if it is a fifth-grade history question, the probability of you being able to solve it may only be 30%. Logistic regression gives you such probability results. Going back to the math, the log odds of an event outcome can be described by a linear combination of the predictor variables: odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence ln(odds) = ln(p/(1-p)) logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3….+bkXk Here, p is the probability of the event we are interested in. It estimates the parameters by selecting specific parameter values that maximize the probability of the observed sample values, rather than minimizing the sum of squared errors as in ordinary regression. You may ask why we need to do logarithms? The simple answer is that this is the best way to repeat the step function. Since this article is not about that, I will not go into details about this aspect. Python code #Import Library fromsklearn.linear_model importLogisticRegression #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create logistic regression object model = LogisticRegression() # Train the model using the training sets and check score model.fit(X,y) model.score(X,y) #Equation coefficient and Intercept print('Coefficient: \n',model.coef_) print('Intercept: \n',model.intercept_) #Predict Output predicted=model.predict(x_test) R code x <-cbind(x_train,y_train) # Train the model using the training sets and check score logistic <-glm(y_train ~.,data =x,family='binomial') summary(logistic) #Predict Output predicted=predict(logistic,x_test) extend : Here are some ideas you can try to optimize your model: Adding interaction Reduce feature variables Regularization Using nonlinear models 3. Decision Tree This is my favorite and most frequently used algorithm. It belongs to supervised learning and is often used to solve classification problems. Surprisingly, it can be applied to both categorical variables and continuous variables. This algorithm allows us to divide a population into two or more groups. The grouping is based on the most important feature variables/independent variables that can distinguish the population. For more details, you can read the article Decision Tree Simplified. From the above figure, we can see that the general population is finally divided into four groups based on whether to play or not. The grouping is based on some characteristic variables. There are many specific indicators used for grouping, such as Gini, information gain, Chi-square, and entropy. The best way to understand the principle of decision trees is to play Jezzball. This is a classic game from Microsoft (see the picture below). The ultimate task of this game is to build walls in a room with moving walls to divide the room into as large spaces as possible without small balls. Every time you build a wall to divide a room, you are actually dividing a population into two parts. Decision trees use a similar method to divide the population into as many different groups as possible. Further reading : Simplified Version of Decision Tree Algorithms Python code #Import Library #Import other necessary libraries like pandas, numpy… fromsklearn importtree #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create tree object model =tree.DecisionTreeClassifier(criterion='gini')# for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini # model = tree.DecisionTreeRegressor() for regression # Train the model using the training sets and check score model.fit(X,y) model.score(X,y) #Predict Output predicted=model.predict(x_test) R code library(rpart) x <-cbind(x_train,y_train) #growtree fit <-rpart(y_train ~.,data =x,method="class") summary(fit) #Predict Output predicted = predict(fit,x_test) 4. Support Vector Machine (SVM) This is a classification algorithm. In this algorithm, we take each data point and plot it in an n-dimensional space (n is the number of features), and each feature value represents the size of the corresponding coordinate value. For example, we have two features: a person's height and hair length. We can plot these two variables in a two-dimensional space, and each point on the graph has two coordinate values (these coordinate axes are also called support vectors). Now we need to find a straight line in the graph that can separate the points of different groups to the greatest extent. The points closest to this line in both sets of data should be the farthest from this line. In the above figure, the black line is the best dividing line. This is because the distance from this line to the closest point in the two groups, point A and point B, is the farthest. Any other line will inevitably make the distance to one of the points closer than this distance. In this way, we can classify the data according to which side of this line the data points are distributed. More reading : Simplified Version of Support Vector Machine We can think of this algorithm as the game of JezzBall in n-dimensional space, but with some changes: You can draw dividing lines/planes at any angle (vertical and horizontal only in the classic game). Now the purpose of this game is to sort the balls of different colors into different spaces. The ball is motionless. Python code #Import Library fromsklearn import svm #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create SVM classification object model =svm.svc()# there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail. # Train the model using the training sets and check score model.fit(X,y) model.score(X,y) #Predict Output predicted=model.predict(x_test) R code library(e1071) x <-cbind(x_train,y_train) # Fitting model fit <-svm(y_train ~.,data =x) summary(fit) #Predict Output predicted = predict(fit,x_test) 5. Naive Bayes This algorithm is a classification method based on Bayesian theory. It assumes that the independent variables are independent of each other. In short, Naive Bayes assumes that the presence of a feature is independent of other features. For example, if a fruit is red, round, and about 7 cm in diameter, we might guess it is an apple. Even if there is a certain relationship between these features, in the Naive Bayes algorithm we assume that red, round, and diameter are independent in determining the probability that a fruit is an apple. Naive Bayes models are easy to build and are very efficient in analyzing large amounts of data. Although the model is simple, it often works better than very complex classification methods. Bayesian theory tells us how to calculate the posterior probability P(c|x) from the prior probability P(c), P(x) and the conditional probability P(x|c). The algorithm is as follows: P(c|x) is the posterior probability of classification c given feature x. P(c) is the prior probability of category c. P(x|c) is the probability that category c has feature x. P(x) is the prior probability of feature x. Example: The following training set includes weather variables and the target variable "whether to go out and play". We now need to divide people into two groups based on the weather conditions: play or not play. The whole process is carried out as follows: Step 1: Make a frequency table based on known data Step 2: Calculate the probability of each situation and make a probability table. For example, the probability of an overcast is 0.29, and the probability of playing at this time is 0.64. Step 3: Use Naive Bayes to calculate the posterior probability of playing and not playing under each weather condition. The result with a large probability is the predicted value. Question: When the weather is sunny, people will play. Is this statement correct? We can answer this question using the method above. P(Yes | Sunny) = P(Sunny | Yes) * P(Yes) / P(Sunny). Here, P(Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P(Yes)= 9/14 = 0.64. Then, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60>0.5, which means that the probability value is larger. When there are multiple categories and multiple features, the prediction method is similar. Naive Bayes is often used for text classification and multi-class classification problems. Python code #Import Library fromsklearn.naive_bayes importGaussianNB #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link # Train the model using the training sets and check score model.fit(X,y) #Predict Output predicted=model.predict(x_test) R code library(e1071) x <-cbind(x_train,y_train) # Fitting model fit <-naiveBayes(y_train ~.,data =x) summary(fit) #Predict Output predicted = predict(fit,x_test) 6. KNN (K-nearest neighbor) This algorithm can solve both classification and regression problems, but it is more commonly used for classification in industry. KNN first records all known data, then uses a distance function to find the K groups of data closest to the unknown event in the known data, and finally predicts the event based on the most common category in these K groups of data. The distance function can be Euclidean distance, Manhattan distance, Minkowski distance, and Hamming distance. The first three are used for continuous variables, and Hamming distance is used for categorical variables. If K=1, the problem is simplified to classification based on the most recent data. The selection of K value is often the key in KNN modeling. KNN is widely used in life. For example, if you want to know someone you don’t know, you may learn about him from his good friends and circles. Before using KNN you need to consider: KNN is computationally expensive All features should be normalized in magnitude, otherwise features with large magnitudes will be offset in the calculated distance. Preprocess the data before performing KNN, such as removing outliers, noise, etc. Python code #Import Library fromsklearn.neighbors importKNeighborsClassifier #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create KNeighbors classifier object model KNeighborsClassifier(n_neighbors=6)# default value for n_neighbors is 5 # Train the model using the training sets and check score model.fit(X,y) #Predict Output predicted=model.predict(x_test) R code library(knn) x <-cbind(x_train,y_train) # Fitting model fit <-knn(y_train ~.,data =x,k=5) summary(fit) #Predict Output predicted = predict(fit,x_test) 7. K-Means This is an unsupervised learning algorithm for solving clustering problems. This method simply uses a certain number of clusters (assuming K clusters) to classify the given data. Data points within the same cluster are of the same type, and data points in different clusters are of different types. Remember how you identified shapes from ink stains? The process of the K-means algorithm is similar. You also need to determine the number of clusters by observing the shape and distribution of the clusters! How does the K-means algorithm divide clusters? Select K data points from each cluster as centroids. Each data point and the centroid closest to it are divided into the same cluster, that is, K new clusters are generated. Find the centroid of the new cluster, so that you have the new centroid. Repeat 2 and 3 until the results converge, that is, no new centroids appear. How to determine the value of K: If we calculate the sum of squared distances from all points in each cluster to the centroid, and then add up the sums of squared distances from different clusters, we get the total sum of squares for this clustering solution. We know that as the number of clusters increases, the total sum of squares decreases. But if you plot the total sum of squares against K, you will find that before a certain K value, the total sum of squares decreases rapidly, but after this K value, the decrease is greatly reduced. This value is the optimal number of clusters. Python code #Import Library fromsklearn.cluster importKMeans #Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset # Create KNeighbors classifier object model k_means =KMeans(n_clusters=3,random_state=0) # Train the model using the training sets and check score model.fit(X) #Predict Output predicted=model.predict(x_test) R code library(cluster) fit <-kmeans(X,3)# 5 cluster solution 8. Random Forest Random forest is a special name for a collection of decision trees. In a random forest we have multiple decision trees (hence the name "forest"). To classify a new observation, each decision tree will give a classification based on its characteristics. The random forest algorithm selects the classification with the most votes as the classification result. How to generate a decision tree: If there are N categories in the training set, N samples are randomly selected repeatedly. These samples will form the training set for cultivating the decision tree. If there are M feature variables, then select the number m << M, so that m feature variables are randomly selected at each node to split the node. m remains unchanged throughout the forest growth. Each decision tree is split to the maximum extent possible without any pruning. Comparing decision trees and tuning model parameters will give you more details on the algorithm. I recommend you to read these articles: Introduction to Random forest – Simplified Comparing a CART model to Random Forest (Part 1) Comparing a Random Forest to a CART model (Part 2) Tuning the parameters of your Random Forest model Python code #Import Library fromsklearn.ensemble importRandomForestClassifier #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create Random Forest object model=RandomForestClassifier() # Train the model using the training sets and check score model.fit(X,y) #Predict Output predicted=model.predict(x_test) R code library(randomForest) x <-cbind(x_train,y_train) # Fitting model fit <-randomForest(Species~.,x,ntree=500) summary(fit) #Predict Output predicted = predict(fit,x_test) 9. Dimensionality Reduction Algorithms In the past 4-5 years, the amount of data available has grown almost exponentially. Companies/government agencies/research organizations not only have more data sources, but also more dimensions of data information. For example: E-commerce companies have more detailed information about customers, such as personal information, web browsing history, personal likes and dislikes, purchase records, feedback information, etc. They pay attention to your private characteristics and know you better than the clerks in the supermarket you go to every day. As a data scientist, we have a lot of features in our data. Although this sounds good for building more powerful and accurate models, they can sometimes be a big problem in modeling. How can we find the most important variables among 1,000 or 2,000 variables? This is where dimensionality reduction algorithms and other algorithms such as decision trees, random forests, PCA, factor analysis, correlation matrices, and default value ratios can help us solve the problem. For further information, read the Beginners Guide To Learn Dimension Reduction Techniques. Python code More information here #Import Library fromsklearn importdecomposition #Assumed you have training and test data set as train and test # Create PCA object pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features) # For Factor analysis #fa= decomposition.FactorAnalysis() # Reduced the dimension of training dataset using PCA train_reduced =pca.fit_transform(train) #Reduced the dimension of test dataset test_reduced =pca.transform(test) R code library(stats) pca <-princomp(train,cor =TRUE) train_reduced <-predict(pca,train) test_reduced <-predict(pca,test) 10.Gradient Boosting and AdaBoost GBM and AdaBoost are both boosting algorithms that improve prediction accuracy when there is a lot of data. Boosting is an ensemble learning method. It improves prediction accuracy by sequentially combining the estimation results of multiple weaker classifiers/estimators. These boosting algorithms have performed well in data science competitions such as Kaggle, AV Hackthon, and CrowdAnalytix. More reading: Know about Gradient and AdaBoost in detail Python code #Import Library fromsklearn.ensemble importGradientBoostingClassifier #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create Gradient Boosting Classifier object model=GradientBoostingClassifier(n_estimators=100,learning_rate=1.0,max_depth=1,random_state=0) # Train the model using the training sets and check score model.fit(X,y) #Predict Output predicted=model.predict(x_test) R code library(caret) x <-cbind(x_train,y_train) # Fitting model fitControl <-trainControl(method =”repeatedcv”,number =4,repeats =4) fit <-train(y ~.,data =x,method =”gbm”,trControl =fitControl,verbose =FALSE) predicted=predict(fit,x_test,type=”prob”)[,2] GradientBoostingClassifier and Random Forest are two different boosting classification trees. People often ask what is the difference between these two algorithms. ◆ ◆ ◆ Conclusion I believe that readers have a certain understanding of the commonly used machine learning algorithms. I wrote this article and provided R and Python codes so that you can start learning right away. Practice it, deepen your understanding of these algorithm processes, and use them. You will love machine learning! Compiled by: @Jiujiu Author: Big Data Digest |
<<: Will it be inheritance or continued change? Who will be the new leader of Nintendo?
Ganzi, Sichuan A very humble name It encompasses ...
For TOP brands, market analysis is a very importa...
Produced by: Science Popularization China Author:...
Apple held its autumn new product launch conferen...
Many students would say: It’s 2020, can self-medi...
I don’t know if you have noticed that this August...
I have always believed that no matter what you do...
Today I will mainly share my experience and opera...
Review expert: Gu Haitong, deputy chief physician...
This article will talk about conversion. Let’s ta...
In the Internet age, image display ads are becomi...
October 8th is International Octopus Day. What? O...
Yiche.com compiled a sales ranking of domesticall...
Anyone who does marketing knows that with the cos...