In this article, we will cover the mathematical details of decision trees (along with various Python examples), their strengths and weaknesses. You will find that they are simple and easy to understand. However, they are generally not competitive with the best supervised learning methods. To overcome the various shortcomings of decision trees, we will focus on various concepts (with Python examples) such as Bootstrap Aggregating or Bagging, and Random Forests. Another widely used method, boosting, will be discussed separately later. Each method involves generating multiple trees that are combined to produce a single consistent prediction result, and often leads to a significant improvement in prediction accuracy. Decision TreeA decision tree is a supervised learning algorithm. It works with both categorical and continuous input (features) and output (predictor) variables. Tree-based methods divide the feature space into a series of rectangles and then fit a simple model (like a constant) to each rectangle. Conceptually, they are simple and effective. Let's first understand decision trees with an example. Then we use a formal analytical approach to analyze the process of creating decision trees. Consider a simple data set of customers of a lending company. We are given the query account balance, credit history, years of tenure, and previous loan status of all customers. The relevant task is to predict whether the customer is a trustworthy risk. This problem can be solved using the following decision tree: Classification and regression trees (CART for short) is a term introduced by Leo Breiman to refer to decision tree algorithms used to solve classification or regression predictive modeling problems. It is often used to generate and implement decision trees using scikit: sklearn.tree.DecisionTreeClassifier and sklearn.tree.DecisionTreeRegressor to build classification and regression trees respectively. CART Model The CART model involves selecting input variables and split points on those variables until an appropriate tree is created. A greedy algorithm is used to choose which input variables and split points to use so as to minimize a cost function. The tree construction ends using a predefined stopping criterion, such as a minimum number of training samples assigned to each leaf node in the tree. Other decision tree algorithms:
Regression Tree Let's now look at the details of the CART algorithm for regression trees. In short, creating a decision tree consists of two steps: 1. Divide the predictor space, i.e. the set of possible values X_1, X_2, ..., X_p, into J distinct and non-overlapping regions R_1, R_2, ..., R_J. 2. Make the same prediction for each sample observation that enters region R_J, which is the mean of the predictions of the training samples in R_J. To create J regions R_1, R_2, ..., R_J, the predictor region is divided into high-dimensional rectangles or boxes. The goal is to find the box-shaped regions R_1, R_2, ..., R_J that minimize the RSS by the following equation: where yhat_Rj is the average predicted value of the training observations in the jth box. Since this spatial partitioning is computationally infeasible, we often use a greedy approach to partitioning the regions, called recursive binary splitting. It is greedy because at each step in the tree creation process, the best split is chosen at each particular step, rather than predicting into the future and picking a split that will appear in future steps and help create a better tree. Note that all partition regions R_j are rectangular. To perform recursive binary partitioning, first select the predictor X_j and the cut point s Where yhat_R1 is the average predicted value of the observations in region R_1(j,s) and yhat_R2 is the average predicted value of the observations in region R_2(j,s). This process is repeated to search for the best predictor and split point, and further separate the data to minimize the RSS in each sub-region. However, we will not split the entire predictor space, we will only split one or two previously identified regions. This process will continue until a stopping criterion is reached, for example, we can set the stopping criterion to contain at most m observations in each region. Once we have created regions R_1, R_2, ..., R_J, given a test sample, we can predict the value of the test sample using the average predicted value of all training samples in the region. Classification Tree Classification trees are very similar to regression trees, except that they predict response values qualitatively rather than quantitatively. As can be seen from the above, the continuous value predicted by a regression tree for an observation is the mean of the training sample observations belonging to the same leaf node. However, for classification trees, the category we predict is the most common category of the training sample observations in a certain area, that is, the mode response of the training observations. In order to achieve the purpose of classification, the system often does not predict only one category, it often predicts a group of categories and their probability of occurrence. The generation of classification trees is very similar to that of regression trees. As in regression trees, we generally use recursive binary segmentation to generate classification trees. However, in classification trees, RSS cannot be used as a criterion for binary segmentation. We need to define the impurity measure Q_m of leaf nodes to replace RSS, which is a method that can measure the homogeneity of the target variable in the subset region R_1, R_2,..., R_j. In node m, we can represent the frequency of the category appearing in a region R_m by N_m sample observations. The frequency of the kth category appearing in the mth region can be expressed as: Where I(y_i=k) is the indicator function, which takes the value 1 if y_i = k and takes the value 0 otherwise. A natural way to measure the impurity Q_m is the classification error rate. The classification error rate describes the probability that a training observation does not belong to the most common category in a region: Considering that this function is not differentiable, it cannot be numerically optimized. In addition, this function is not sensitive to changes in node probabilities, so this classification error rate is very inefficient for generating trees. We generally use the Gini index and cross entropy function to measure the error metric of the node. The Gini index measures the total variance of k categories and is generally defined as: A smaller Gini index value indicates that the node contains most of the sample observations of a certain category. In information theory, the cross entropy function is used to measure the disorder of a system. For a binary system, if the system contains all the contents of one category, then its value is zero, and if the number of two categories is the same, then the cross entropy reaches a maximum of 1. Therefore, like the Gini index, the cross entropy function can also be used to measure the impurity of a node: As with G, smaller values of S indicate that the nodes in the region contain most observations in a single category. Common parameters and concepts of decision trees If we want to understand decision trees mathematically, we first need to understand the general concept of decision trees and tree learning algorithms. Understanding the following terms will also help us tune the model.
Implementation of classification tree To demonstrate the different decision tree models mentioned above, we will use the US income dataset on Kaggle, which can be downloaded on Kaggle.com. The following code can show the import process and part of the content of the dataset: import pandas as pd import numpy as np from plotnine import * import matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoder from sklearn_pandas import DataFrameMapper from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier training_data = './adult-training.csv' test_data = './adult-test.csv' columns = ['Age','Workclass','fnlgwt','Education','EdNum','MaritalStatus','Occupation','Relationship','Race','Sex','CapitalGain','CapitalLoss','HoursPerWeek','Country','Income'] df_train_set = pd.read_csv(training_data, names=columns) df_test_set = pd.read_csv(test_data, names=columns, skiprows=1) df_train_set.drop('fnlgwt', axis=1, inplace=True) df_test_set.drop('fnlgwt', axis=1, inplace=True) In the above code, we first need to import all the required libraries and modules, and then read the data and structure into training data and validation data. We also remove the fnlgwt column because this data row is not important for training the model. Enter the following statement to see the first five rows of training data: df_train_set.head() As shown below, we still need to do some data cleaning. We need to remove special characters from all columns, and any spaces or "." need to be removed. #replace the special character to "Unknown" for i in df_train_set.columns: df_train_set[i].replace(' ?', 'Unknown', inplace=True) df_test_set[i].replace(' ?', 'Unknown', inplace=True)for col in df_train_set.columns:if df_train_set[col].dtype != 'int64': df_train_set[col] = df_train_set[col].apply(lambda val: val.replace(" ", "")) df_train_set[col] = df_train_set[col].apply(lambda val: val.replace(".", "")) df_test_set[col] = df_test_set[col].apply(lambda val: val.replace(" ", "")) df_test_set[col] = df_test_set[col].apply(lambda val: val.replace(".", "")) As shown in the above figure, there are two rows describing the education of an individual: Eduction and EdNum. We assume that these two features are highly correlated, so we can remove the Education column. The Country column does not play any role in predicting income, so we need to remove it. df_train_set.drop(["Country", "Education"], axis=1, inplace=True) df_test_set.drop(["Country", "Education"], axis=1, inplace=True) The Age and EdNum columns are numeric. We can convert continuous numeric data into a more efficient way, such as changing the age to multiples of 10 years and the education years to multiples of 5 years. The implementation code is as follows: colnames = list(df_train_set.columns) colnames.remove('Age') colnames.remove('EdNum') colnames = ['AgeGroup', 'Education'] + colnames labels = ["{0}-{1}".format(i, i + 9) for i in range(0, 100, 10)] df_train_set['AgeGroup'] = pd.cut(df_train_set.Age, range(0, 101, 10), right=False, labels=labels) df_test_set['AgeGroup'] = pd.cut(df_test_set.Age, range(0, 101, 10), right=False, labels=labels) labels = ["{0}-{1}".format(i, i + 4) for i in range(0, 20, 5)] df_train_set['Education'] = pd.cut(df_train_set.EdNum, range(0, 21, 5), right=False, labels=labels) df_test_set['Education'] = pd.cut(df_test_set.EdNum, range(0, 21, 5), right=False, labels=labels) df_train_set = df_train_set[colnames] df_test_set = df_test_set[colnames] Now that we have cleaned the data, the following statement can show an overview of our data: df_train_set.Income.value_counts() In the training set and test set, we found that the number of categories <=50K is 3 times more than that of categories >50K. From this we can see that the sample data is not balanced data, but to simplify the problem here, we treat the data set as a normal problem. EDANow, let’s take a look at the distribution and inter-dependence of different features in the training data in the form of a graph. First, let’s look at how the Relationships and MaritalStatus features are related to each other. (ggplot(df_train_set, aes(x = "Relationship", fill = "MaritalStatus"))+ geom_bar(position="fill")+ theme(axis_text_x = element_text(angle = 60, hjust = 1))) Let’s first look at the effect of education on earnings (measured as years of schooling) across age groups. (ggplot(df_train_set, aes(x = "Education", fill = "Income"))+ geom_bar(position="fill")+ theme(axis_text_x = element_text(angle = 60, hjust = 1))+ facet_wrap('~AgeGroup')) There has been a lot of talk lately about the impact of gender on the income gap. We can see the impact separately for men and women by education level and by race. (ggplot(df_train_set, aes(x = "Education", fill = "Income"))+ geom_bar(position="fill")+ theme(axis_text_x = element_text(angle = -90, hjust = 1))+ facet_wrap('~Sex')) (ggplot(df_train_set, aes(x = "Race", fill = "Income"))+ geom_bar(position="fill")+ theme(axis_text_x = element_text(angle = -90, hjust = 1))+ facet_wrap('~Sex')) Until now, we have only looked at the correlation of non-numeric features. Now let’s look at the impact of Capital Gain and Capital Loss on Income. (ggplot(df_train_set, aes(x="Income", y="CapitalGain"))+ geom_jitter(position=position_jitter(0.1))) (ggplot(df_train_set, aes(x="Income", y="CapitalLoss"))+ geom_jitter(position=position_jitter(0.1))) Tree ClassifierNow that we understand some of the relationships in our data, we can create a simple tree classifier model using sklearn.tree.DecisionTreeClassifier. However, in order to use this model, we need to convert all of our non-numeric data into numeric data. We can easily do this directly in the Pandas data frame using the sklearn.preprocessing.LabeEncoder module and the sklearn_pandas module. mapper = DataFrameMapper([('AgeGroup', LabelEncoder()),('Education', LabelEncoder()),('Workclass', LabelEncoder()),('MaritalStatus', LabelEncoder()),('Occupation', LabelEncoder()),('Relationship', LabelEncoder()),('Race', LabelEncoder()),('Sex', LabelEncoder()),('Income', LabelEncoder())], df_out=True, default=None) cols = list(df_train_set.columns) cols.remove("Income") cols = cols[:-3] + ["Income"] + cols[-3:] df_train = mapper.fit_transform(df_train_set.copy()) df_train.columns = cols df_test = mapper.transform(df_test_set.copy()) df_test.columns = cols cols.remove("Income") x_train, y_train = df_train[cols].values, df_train["Income"].values x_test, y_test = df_test[cols].values, df_test["Income"].values Now that we have our training and test data in the correct form, we have created our first model! treeClassifier = DecisionTreeClassifier() treeClassifier.fit(x_train, y_train) treeClassifier.score(x_test, y_test) The simplest and unoptimized probabilistic classifier model can achieve 83.5% accuracy. In classification problems, the confusion matrix is a good way to measure the accuracy of the model. Using the following code we can plot the confusion matrix of any tree-based model. import itertoolsfrom sklearn.metrics import confusion_matrixdef plot_confusion_matrix(cm, classes, normalize=False):""" This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ cmap = plt.cm.Blues title = "Confusion Matrix"if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] cm = np.around(cm, decimals=3) plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) thresh = cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label') Now, we can see the confusion matrix for the first model: y_pred = treeClassifier.predict(x_test) cfm = confusion_matrix(y_test, y_pred, labels=[0, 1]) plt.figure(figsize=(10,6)) plot_confusion_matrix(cfm, classes=["<=50K", ">50K"], normalize=True) We found that the accuracy for the majority class (<=50K) was 90.5%, while the accuracy for the minority class (>50K) was only 60.8%. Let's look at ways to tune this simple classifier. We can use GridSearchCV() with 5-fold cross validation to tune various important parameters of the tree classifier. from sklearn.model_selection import GridSearchCV After optimization, we found that the accuracy increased to 85.9%. Above, we can also see the parameters of the optimal model. Now, let's take a look at the confusion matrix of the optimized model: y_pred = clf.predict(x_test) cfm = confusion_matrix(y_test, y_pred, labels=[0, 1]) plt.figure(figsize=(10,6)) plot_confusion_matrix(cfm, classes=["<=50K", ">50K"], normalize=True) After optimization, we found that the prediction accuracy was improved in both categories. Limitations of Decision Trees Decision trees have many advantages, such as:
But these models are often not used directly. Some common flaws of decision trees are:
Most of these limitations can be easily addressed by improving decision trees. In the following, we will introduce several related concepts, focusing on bagging and random forests. Pruning Since decision trees tend to overfit the data, a small tree with fewer branches (i.e. reduced regions R_1, … ,R_J) has a slightly higher bias, but produces lower variance and is more interpretable. One way to deal with this is to build a tree where each branch causes the leaf node error rate Qm to drop above a (high) threshold. However, due to the greedy nature of the splitting algorithm, it is actually short-sighted. A seemingly useless split early in the decision tree may lead to an excellent split later, causing Qm to drop significantly. Therefore, a better strategy is to build a very large tree T_0, and then prune it to get a subtree. There are many strategies for pruning. Cost complexity pruning, also known as weakest link pruning, is one of the effective strategies. In addition to considering every possible subtree, it is also necessary to consider the tree sequence indexed by the nonnegative tuning parameter α. Each α value corresponds to a subtree T⊂T_0 that is as small as possible. Here |T| represents the number of leaf nodes in the tree T, R_m represents the rectangle corresponding to the mth leaf node (a subset of the predictor space), and yhat_Rm is the predicted value of Rm, that is, the mean of the predicted values of the training samples in Rm (or the mode response in the classification tree). Adjusting the parameter α controls the trade-off between the complexity of the subtrees and fits the training data. When α=0, the subtree T is equivalent to T_0. As the value of α grows, it is expensive to build a tree with many subnodes, so that the above formula will be minimized in order to get smaller subtrees. We can use some cross-validation method to choose the pruning parameter α. Note that currently sklearn.tree decision tree classifiers (and regressors) do not support pruning. Bagging (Bootstrap Aggregating——Bagging)In statistics, bootstrap is any experiment or measure that relies on random sampling with replacement. We have seen above that decision trees suffer from high variance. This means that if we randomly split the training data into two parts and fit a decision tree to both, the results we get may be quite different. Bootstrap aggregation, or bagging, is a general procedure for reducing the variance of statistical learning methods. Given a set of n independent sample observations Z_1, Z_2, ..., Z_n, each with variance *σ^*2, the variance of the mean of the sample observations is *σ^*2/*n*. In other words, averaging a set of observations reduces the variance. So a natural way to reduce the variance, and thus increase the predictive accuracy of a statistical learning method, is to take many training sets from the population, create a separate predictive model using each training set, and average the predictions. The problem here is that we cannot get multiple training datasets. Instead, we can bootstrap by taking repeated samples from the (single) training dataset. In this approach, we generate B different bootstrapped training datasets. We then get a prediction on the bth bootstrapped training dataset, thus obtaining an aggregate prediction. This is called bagging. Note that aggregating can have different means for regression and classification problems. While averaging the predictions works well for regression problems, we will want to use majority vote: due to the aggregating mechanism in classification problems, the overall prediction is the main class that appears most often among the B predictions. Out-of-Bag (OOB) Error The biggest advantage of bagging is that we can calculate the test error without cross-validation. Recall that the essence of bagging is that multiple trees are repeatedly fit to bootstrap subsets of observations. On average, each bagged tree will make use of 2/3 of the observations. The remaining 1/3 of the observations are called out-of-bag (OOB) observations, which will not fit a given bagged tree. We can calculate the prediction for the ith observation using the OOB observations of each tree, which will result in approximately B/3 predictions that can predict the ith observation. Now we can use agglomeration techniques similar to bagging (mean regression and majority voting classification) to obtain a single prediction for the ith observation. We can use this method to obtain OOB predictions for n observations, so the overall OOB MSE (for regression problems) and classification error rate (for classification problems) can be calculated. The OOB error result is a valid estimate of the test error of the bagging model, because the prediction for each example is only based on examples that were not fit to the training model. Feature Importance Metrics Bagging generally improves prediction accuracy by using a single tree. However, interpreting the final model can be difficult. When we bag a large number of trees, it is no longer possible to characterize the final statistical learning process using a single tree, so bagging improves prediction accuracy at the expense of interpretability. Interestingly, one can get an overall summary of each predictor using RSS (for bagging regression trees) or the Gini index (for bagging classification trees). In the case of bagging regression trees, we can record the total amount of RSS reduction due to a given predictor numerator split averaged over all B-trees. A large value indicates an important predictor. Similarly, in the case of bagging classification trees, we can add the total amount of Gini index reduction due to a given predictor numerator split averaged over all B-trees. Once training is complete, the different bagged tree learning methods of the sklearn module provide direct access to feature importance data as attributes. Random Forest ModelWhile bagging improves the predictive performance of general decision trees by reducing variance, it suffers from other drawbacks: bagging requires us to grow the entire tree on bootstrap samples, which increases the computational complexity by a factor of B. Furthermore, because bagging-based trees are correlated, the predictive accuracy saturates according to B. Random forests decorrelate all trees by random perturbations, so random forests perform better than bagging. Unlike bagging, random forests use random sample predictors before each node split when building each tree. Because the core idea of random forests is the same as bagging trees, it has a reduction in variance. In addition, random forests can consider using a large number of predictors, not only because this method reduces bias, but also because local feature predictors play an important role in decision making in the tree structure. Random forests can use a huge number of predictors, even more than the number of observations. The most significant advantage of using random forests is that it can obtain more information to reduce the bias of fitted values and estimated splits. Usually we have a few predictors that dominate the decision tree fitting process because their average performance is consistently better than some other competing predictors. Therefore, many other predictors that are useful for local data features are not selected as split variables. As the random forest calculates enough decision tree models, each predictor has at least a few chances to become the predictor that defines the split. In most cases, we do not only have dominant predictors, but also feature predictors have a chance to define the split of the dataset. There are three main hyperparameters to tune for random forests:
Random Forest Model Example Using the same income data as above, we now build a simple random forest classifier model with 500 trees: rclf = RandomForestClassifier(n_estimators=500) rclf.fit(x_train, y_train) rclf.score(x_test, y_test) Even without any optimization, we still find that the model performance is comparable to the optimized decision tree classifier and achieves a test score of 85.1%. According to the confusion matrix below, we find that the simple random forest and the optimized tree classifier perform similarly, achieving a prediction accuracy of 92.1% for the majority class (<=50K revenue) and 62.6% for the minority class (>50K revenue). rclf = RandomForestClassifier(n_estimators=500) rclf.fit(x_train, y_train) rclf.score(x_test, y_test) As discussed earlier, the Random Forest model also provides a measure of feature importance. We can see the importance of different features in the current model in the figure below: importances = rclf.feature_importances_ indices = np.argsort(importances) cols = [cols[x] for x in indices] plt.figure(figsize=(10,6)) plt.title('Feature Importances') plt.barh(range(len(indices)), importances[indices], color='b', align='center') plt.yticks(range(len(indices)), cols) plt.xlabel('Relative Importance') Now we can try to optimize our random forest model. We can use the GridSearchCV() operation with 5-fold cross validation to optimize the random forest as follows: parameters = {'n_estimators':(100, 500, 1000),'max_depth':(None, 24, 16),'min_samples_split': (2, 4, 8),'min_samples_leaf': (16, 4, 12)} clf = GridSearchCV(RandomForestClassifier(), parameters, cv=5, n_jobs=8) clf.fit(x_train, y_train) clf.best_score_, clf.best_params_ 0.86606676699118579 {'max_depth': 24, 'min_samples_leaf': 4, 'min_samples_split': 4, 'n_estimators': 1000} We can see that the current model is significantly better than the previous one, and has a prediction rate of 86.6%. According to the confusion matrix below, the new model has a significant improvement in the prediction accuracy of the main category, and only slightly reduced the accuracy of the prediction of the minority category. This is a common problem with imbalanced data. rclf2 = RandomForestClassifier(n_estimators=1000,max_depth=24,min_samples_leaf=4,min_samples_split=8) rclf2.fit(x_train, y_train) y_pred = rclf2.predict(x_test) cfm = confusion_matrix(y_test, y_pred, labels=[0, 1]) plt.figure(figsize=(10,6)) plot_confusion_matrix(cfm, classes=["<=50K", ">50K"], normalize=True) Finally, the features that are important to the optimized model are shown below. importances = rclf2.feature_importances_ indices = np.argsort(importances) cols = [cols[x] for x in indices] plt.figure(figsize=(10,6)) plt.title('Feature Importances') plt.barh(range(len(indices)), importances[indices], color='b', align='center') plt.yticks(range(len(indices)), cols) plt.xlabel('Relative Importance') Limitations of Random Forests In addition to the general limitations of bagging tree models, random forests have some limitations:
In general, random forests are generally less accurate than boosting methods on many tasks and take longer to run. Therefore, many models in Kaggle competitions use the gradient boosting tree algorithm or other excellent boosting methods. |
<<: Aiti Tribe Story Collection (25): Analysis of Chain Storage Stack and Code Implementation
>>: 75 Big Data Terms You Should Know
[[127002]] Three elements for Baidu's success...
People familiar with the matter revealed that for...
recently Photos by Zhai Zhigang and Ye Guangfu ——...
The essence of a private domain is to have long-t...
On November 18, 1987, Guangzhou Mobile Phone Bure...
1. Gentle Spring Breeze VS Violent Spring Breeze ...
Why should you set your Moments to be visible onl...
Mixed Knowledge Specially designed to cure confus...
Are there too many tricks in this year's 618 ...
51CTO Network+ Platform launched the "TechNe...
[September 10 news] The specific time of Apple...
According to Counterpoint’s latest Global Cellula...
The charging process of a smartphone is mainly com...
In fact, I really hate topics like entrepreneursh...
Qingming Festival is just a few days away. Accord...