From decision trees to random forests: the principle and implementation of tree-based algorithms

From decision trees to random forests: the principle and implementation of tree-based algorithms

In this article, we will cover the mathematical details of decision trees (along with various Python examples), their strengths and weaknesses. You will find that they are simple and easy to understand. However, they are generally not competitive with the best supervised learning methods. To overcome the various shortcomings of decision trees, we will focus on various concepts (with Python examples) such as Bootstrap Aggregating or Bagging, and Random Forests. Another widely used method, boosting, will be discussed separately later. Each method involves generating multiple trees that are combined to produce a single consistent prediction result, and often leads to a significant improvement in prediction accuracy.

Decision Tree

A decision tree is a supervised learning algorithm. It works with both categorical and continuous input (features) and output (predictor) variables. Tree-based methods divide the feature space into a series of rectangles and then fit a simple model (like a constant) to each rectangle. Conceptually, they are simple and effective. Let's first understand decision trees with an example. Then we use a formal analytical approach to analyze the process of creating decision trees. Consider a simple data set of customers of a lending company. We are given the query account balance, credit history, years of tenure, and previous loan status of all customers. The relevant task is to predict whether the customer is a trustworthy risk. This problem can be solved using the following decision tree:

Classification and regression trees (CART for short) is a term introduced by Leo Breiman to refer to decision tree algorithms used to solve classification or regression predictive modeling problems. It is often used to generate and implement decision trees using scikit: sklearn.tree.DecisionTreeClassifier and sklearn.tree.DecisionTreeRegressor to build classification and regression trees respectively.

CART Model

The CART model involves selecting input variables and split points on those variables until an appropriate tree is created. A greedy algorithm is used to choose which input variables and split points to use so as to minimize a cost function.

The tree construction ends using a predefined stopping criterion, such as a minimum number of training samples assigned to each leaf node in the tree.

Other decision tree algorithms:

  • ID3: Iterative Dichotomiser 3
  • C4.5: Improvement of ID3 algorithm
  • CHAID: Chi-squared Automatic Interaction Detector
  • MARS: An extension of decision trees to better handle numerical predictions.
  • Conditional Inference Tree

Regression Tree

Let's now look at the details of the CART algorithm for regression trees. In short, creating a decision tree consists of two steps:

1. Divide the predictor space, i.e. the set of possible values ​​X_1, X_2, ..., X_p, into J distinct and non-overlapping regions R_1, R_2, ..., R_J.

2. Make the same prediction for each sample observation that enters region R_J, which is the mean of the predictions of the training samples in R_J.

To create J regions R_1, R_2, ..., R_J, the predictor region is divided into high-dimensional rectangles or boxes. The goal is to find the box-shaped regions R_1, R_2, ..., R_J that minimize the RSS by the following equation:

where yhat_Rj is the average predicted value of the training observations in the jth box.

Since this spatial partitioning is computationally infeasible, we often use a greedy approach to partitioning the regions, called recursive binary splitting.

It is greedy because at each step in the tree creation process, the best split is chosen at each particular step, rather than predicting into the future and picking a split that will appear in future steps and help create a better tree. Note that all partition regions R_j are rectangular. To perform recursive binary partitioning, first select the predictor X_j and the cut point s

Where yhat_R1 is the average predicted value of the observations in region R_1(j,s) and yhat_R2 is the average predicted value of the observations in region R_2(j,s). This process is repeated to search for the best predictor and split point, and further separate the data to minimize the RSS in each sub-region. However, we will not split the entire predictor space, we will only split one or two previously identified regions. This process will continue until a stopping criterion is reached, for example, we can set the stopping criterion to contain at most m observations in each region. Once we have created regions R_1, R_2, ..., R_J, given a test sample, we can predict the value of the test sample using the average predicted value of all training samples in the region.

Classification Tree

Classification trees are very similar to regression trees, except that they predict response values ​​qualitatively rather than quantitatively. As can be seen from the above, the continuous value predicted by a regression tree for an observation is the mean of the training sample observations belonging to the same leaf node. However, for classification trees, the category we predict is the most common category of the training sample observations in a certain area, that is, the mode response of the training observations. In order to achieve the purpose of classification, the system often does not predict only one category, it often predicts a group of categories and their probability of occurrence.

The generation of classification trees is very similar to that of regression trees. As in regression trees, we generally use recursive binary segmentation to generate classification trees. However, in classification trees, RSS cannot be used as a criterion for binary segmentation. We need to define the impurity measure Q_m of leaf nodes to replace RSS, which is a method that can measure the homogeneity of the target variable in the subset region R_1, R_2,..., R_j. In node m, we can represent the frequency of the category appearing in a region R_m by N_m sample observations. The frequency of the kth category appearing in the mth region can be expressed as:

Where I(y_i=k) is the indicator function, which takes the value 1 if y_i = k and takes the value 0 otherwise.

A natural way to measure the impurity Q_m is the classification error rate. The classification error rate describes the probability that a training observation does not belong to the most common category in a region:

Considering that this function is not differentiable, it cannot be numerically optimized. In addition, this function is not sensitive to changes in node probabilities, so this classification error rate is very inefficient for generating trees. We generally use the Gini index and cross entropy function to measure the error metric of the node.

The Gini index measures the total variance of k categories and is generally defined as:

A smaller Gini index value indicates that the node contains most of the sample observations of a certain category.

In information theory, the cross entropy function is used to measure the disorder of a system. For a binary system, if the system contains all the contents of one category, then its value is zero, and if the number of two categories is the same, then the cross entropy reaches a maximum of 1. Therefore, like the Gini index, the cross entropy function can also be used to measure the impurity of a node:

As with G, smaller values ​​of S indicate that the nodes in the region contain most observations in a single category.

Common parameters and concepts of decision trees

If we want to understand decision trees mathematically, we first need to understand the general concept of decision trees and tree learning algorithms. Understanding the following terms will also help us tune the model.

  • Root node: A parent node that represents all data samples and can be further divided into two or more child nodes.
  • Splitting: The process of dividing a node into two or more sub-nodes.
  • Decision node: When a sub-node can be further split into multiple sub-nodes, then the node is called a decision node.
  • Leaf/Terminal Node: A node that will not be split further down and represents a category in a classification tree.
  • Branch/Subtree: A part of the entire decision tree.
  • Parent node and child node: If a node splits downward, the node is called the parent node and the node split from the parent node is called the child node.
  • Minimum number of samples for node splitting: The minimum number of samples (or observations) required in node splitting. This method can usually be used to prevent overfitting. A larger minimum number of samples can prevent the model from learning too specific relationships for specific samples. This hyperparameter should be adjusted using a validation set.
  • Minimum number of leaf node samples: The minimum number of samples required for leaf nodes. Like the minimum number of samples for node splitting, this hyperparameter can also be used to control overfitting. For unbalanced class problems, we should take a smaller value because the number of samples belonging to fewer classes may be very small.
  • Maximum depth of the tree (vertical depth): This hyperparameter can also be used to control overfitting problems. A smaller depth can prevent the model from learning overly specific relationships for specific samples. This hyperparameter also needs to be adjusted in the validation set.
  • Maximum number of leaf nodes: The maximum number of leaf nodes can replace the maximum depth of the number. Because when generating a binary tree with a depth of n, the maximum number of leaf nodes it can generate is 2^n.
  • The maximum number of features that need to be considered for splitting: that is, the number of features that need to be considered when we search for a better separation solution. Our common method is to take the square root of the total number of available features as the maximum number of features.

Implementation of classification tree

To demonstrate the different decision tree models mentioned above, we will use the US income dataset on Kaggle, which can be downloaded on Kaggle.com. The following code can show the import process and part of the content of the dataset:

import pandas as pd
import numpy as np
from plotnine import *
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn_pandas import DataFrameMapper
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

training_data = './adult-training.csv'
test_data = './adult-test.csv'

columns = ['Age','Workclass','fnlgwt','Education','EdNum','MaritalStatus','Occupation','Relationship','Race','Sex','CapitalGain','CapitalLoss','HoursPerWeek','Country','Income']

df_train_set = pd.read_csv(training_data, names=columns)
df_test_set = pd.read_csv(test_data, names=columns, skiprows=1)
df_train_set.drop('fnlgwt', axis=1, inplace=True)
df_test_set.drop('fnlgwt', axis=1, inplace=True)

In the above code, we first need to import all the required libraries and modules, and then read the data and structure into training data and validation data. We also remove the fnlgwt column because this data row is not important for training the model.

Enter the following statement to see the first five rows of training data:

df_train_set.head()

As shown below, we still need to do some data cleaning. We need to remove special characters from all columns, and any spaces or "." need to be removed.

#replace the special character to "Unknown" for i in df_train_set.columns:
    df_train_set[i].replace(' ?', 'Unknown', inplace=True)
    df_test_set[i].replace(' ?', 'Unknown', inplace=True)for col in df_train_set.columns:if df_train_set[col].dtype != 'int64':
        df_train_set[col] = df_train_set[col].apply(lambda val: val.replace(" ", ""))
        df_train_set[col] = df_train_set[col].apply(lambda val: val.replace(".", ""))
        df_test_set[col] = df_test_set[col].apply(lambda val: val.replace(" ", ""))
        df_test_set[col] = df_test_set[col].apply(lambda val: val.replace(".", ""))

As shown in the above figure, there are two rows describing the education of an individual: Eduction and EdNum. We assume that these two features are highly correlated, so we can remove the Education column. The Country column does not play any role in predicting income, so we need to remove it.

df_train_set.drop(["Country", "Education"], axis=1, inplace=True)
df_test_set.drop(["Country", "Education"], axis=1, inplace=True)

The Age and EdNum columns are numeric. We can convert continuous numeric data into a more efficient way, such as changing the age to multiples of 10 years and the education years to multiples of 5 years. The implementation code is as follows:

colnames = list(df_train_set.columns)
colnames.remove('Age')
colnames.remove('EdNum')
colnames = ['AgeGroup', 'Education'] + colnames

labels = ["{0}-{1}".format(i, i + 9) for i in range(0, 100, 10)]
df_train_set['AgeGroup'] = pd.cut(df_train_set.Age, range(0, 101, 10), right=False, labels=labels)
df_test_set['AgeGroup'] = pd.cut(df_test_set.Age, range(0, 101, 10), right=False, labels=labels)

labels = ["{0}-{1}".format(i, i + 4) for i in range(0, 20, 5)]
df_train_set['Education'] = pd.cut(df_train_set.EdNum, range(0, 21, 5), right=False, labels=labels)
df_test_set['Education'] = pd.cut(df_test_set.EdNum, range(0, 21, 5), right=False, labels=labels)

df_train_set = df_train_set[colnames]
df_test_set = df_test_set[colnames]

Now that we have cleaned the data, the following statement can show an overview of our data:

df_train_set.Income.value_counts()
<=50K 24720
>50K 7841
Name: Income, dtype: int64
df_test_set.Income.value_counts()
<=50K 12435
>50K 3846
Name: Income, dtype: int64

In the training set and test set, we found that the number of categories <=50K is 3 times more than that of categories >50K. From this we can see that the sample data is not balanced data, but to simplify the problem here, we treat the data set as a normal problem.

EDA

Now, let’s take a look at the distribution and inter-dependence of different features in the training data in the form of a graph. First, let’s look at how the Relationships and MaritalStatus features are related to each other.

(ggplot(df_train_set, aes(x = "Relationship", fill = "MaritalStatus"))+ geom_bar(position="fill")+ theme(axis_text_x = element_text(angle = 60, hjust = 1)))

Let’s first look at the effect of education on earnings (measured as years of schooling) across age groups.

(ggplot(df_train_set, aes(x = "Education", fill = "Income"))+ geom_bar(position="fill")+ theme(axis_text_x = element_text(angle = 60, hjust = 1))+ facet_wrap('~AgeGroup'))

There has been a lot of talk lately about the impact of gender on the income gap. We can see the impact separately for men and women by education level and by race.

(ggplot(df_train_set, aes(x = "Education", fill = "Income"))+ geom_bar(position="fill")+ theme(axis_text_x = element_text(angle = -90, hjust = 1))+ facet_wrap('~Sex'))

(ggplot(df_train_set, aes(x = "Race", fill = "Income"))+ geom_bar(position="fill")+ theme(axis_text_x = element_text(angle = -90, hjust = 1))+ facet_wrap('~Sex'))

Until now, we have only looked at the correlation of non-numeric features. Now let’s look at the impact of Capital Gain and Capital Loss on Income.

(ggplot(df_train_set, aes(x="Income", y="CapitalGain"))+ geom_jitter(position=position_jitter(0.1)))

(ggplot(df_train_set, aes(x="Income", y="CapitalLoss"))+ geom_jitter(position=position_jitter(0.1)))

Tree Classifier

Now that we understand some of the relationships in our data, we can create a simple tree classifier model using sklearn.tree.DecisionTreeClassifier. However, in order to use this model, we need to convert all of our non-numeric data into numeric data. We can easily do this directly in the Pandas data frame using the sklearn.preprocessing.LabeEncoder module and the sklearn_pandas module.

mapper = DataFrameMapper([('AgeGroup', LabelEncoder()),('Education', LabelEncoder()),('Workclass', LabelEncoder()),('MaritalStatus', LabelEncoder()),('Occupation', LabelEncoder()),('Relationship', LabelEncoder()),('Race', LabelEncoder()),('Sex', LabelEncoder()),('Income', LabelEncoder())], df_out=True, default=None)

cols = list(df_train_set.columns)
cols.remove("Income")
cols = cols[:-3] + ["Income"] + cols[-3:]

df_train = mapper.fit_transform(df_train_set.copy())
df_train.columns = cols

df_test = mapper.transform(df_test_set.copy())
df_test.columns = cols

cols.remove("Income")
x_train, y_train = df_train[cols].values, df_train["Income"].values
x_test, y_test = df_test[cols].values, df_test["Income"].values

Now that we have our training and test data in the correct form, we have created our first model!

treeClassifier = DecisionTreeClassifier()
treeClassifier.fit(x_train, y_train)
treeClassifier.score(x_test, y_test)

The simplest and unoptimized probabilistic classifier model can achieve 83.5% accuracy. In classification problems, the confusion matrix is ​​a good way to measure the accuracy of the model. Using the following code we can plot the confusion matrix of any tree-based model.

import itertoolsfrom sklearn.metrics import confusion_matrixdef plot_confusion_matrix(cm, classes, normalize=False):"""
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    cmap = plt.cm.Blues
    title = "Confusion Matrix"if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        cm = np.around(cm, decimals=3)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

Now, we can see the confusion matrix for the first model:

y_pred = treeClassifier.predict(x_test)
cfm = confusion_matrix(y_test, y_pred, labels=[0, 1])
plt.figure(figsize=(10,6))
plot_confusion_matrix(cfm, classes=["<=50K", ">50K"], normalize=True)

We found that the accuracy for the majority class (<=50K) was 90.5%, while the accuracy for the minority class (>50K) was only 60.8%.

Let's look at ways to tune this simple classifier. We can use GridSearchCV() with 5-fold cross validation to tune various important parameters of the tree classifier.

from sklearn.model_selection import GridSearchCV
parameters = {'max_features':(None, 9, 6),'max_depth':(None, 24, 16),'min_samples_split': (2, 4, 8),'min_samples_leaf': (16, 4, 12)}

clf = GridSearchCV(treeClassifier, parameters, cv=5, n_jobs=4)
clf.fit(x_train, y_train)
clf.best_score_, clf.score(x_test, y_test), clf.best_params_
(0.85934092933263717,
0.85897672133161351,
{'max_depth': 16,
'max_features': 9,
'min_samples_leaf': 16,
'min_samples_split': 8})

After optimization, we found that the accuracy increased to 85.9%. Above, we can also see the parameters of the optimal model. Now, let's take a look at the confusion matrix of the optimized model:

y_pred = clf.predict(x_test)
cfm = confusion_matrix(y_test, y_pred, labels=[0, 1])
plt.figure(figsize=(10,6))
plot_confusion_matrix(cfm, classes=["<=50K", ">50K"], normalize=True)

After optimization, we found that the prediction accuracy was improved in both categories.

Limitations of Decision Trees

Decision trees have many advantages, such as:

  • Easy to understand and explain
  • Visualization
  • No extensive data preparation is required. However, please note that the sklearn.tree module does not support missing values.
  • The cost of using a decision tree (predicting data) is the logarithm of the data used to train the decision.

But these models are often not used directly. Some common flaws of decision trees are:

  • The tree constructed is too complex and does not generalize well on the data.
  • A small change in the data may result in a completely different tree being generated, so decision trees are not stable enough.
  • In practice, decision tree learning algorithms are usually based on heuristic algorithms, such as greedy algorithms, which make local optimal decisions at each node. Such algorithms cannot guarantee the return of the global optimal decision tree.
  • If some classes dominate, the decision tree built by the decision tree learner will be biased. Therefore, it is recommended to balance the dataset before fitting it with a decision tree.
  • Certain classes of functions are difficult to model using decision tree models, such as XOR, parity, and multiplexers.

Most of these limitations can be easily addressed by improving decision trees. In the following, we will introduce several related concepts, focusing on bagging and random forests.

Pruning

Since decision trees tend to overfit the data, a small tree with fewer branches (i.e. reduced regions R_1, … ,R_J) has a slightly higher bias, but produces lower variance and is more interpretable. One way to deal with this is to build a tree where each branch causes the leaf node error rate Qm to drop above a (high) threshold. However, due to the greedy nature of the splitting algorithm, it is actually short-sighted. A seemingly useless split early in the decision tree may lead to an excellent split later, causing Qm to drop significantly.

Therefore, a better strategy is to build a very large tree T_0, and then prune it to get a subtree. There are many strategies for pruning. Cost complexity pruning, also known as weakest link pruning, is one of the effective strategies. In addition to considering every possible subtree, it is also necessary to consider the tree sequence indexed by the nonnegative tuning parameter α. Each α value corresponds to a subtree T⊂T_0 that is as small as possible.

Here |T| represents the number of leaf nodes in the tree T, R_m represents the rectangle corresponding to the mth leaf node (a subset of the predictor space), and yhat_Rm is the predicted value of Rm, that is, the mean of the predicted values ​​of the training samples in Rm (or the mode response in the classification tree). Adjusting the parameter α controls the trade-off between the complexity of the subtrees and fits the training data. When α=0, the subtree T is equivalent to T_0. As the value of α grows, it is expensive to build a tree with many subnodes, so that the above formula will be minimized in order to get smaller subtrees. We can use some cross-validation method to choose the pruning parameter α.

Note that currently sklearn.tree decision tree classifiers (and regressors) do not support pruning.

Bagging (Bootstrap Aggregating——Bagging)

In statistics, bootstrap is any experiment or measure that relies on random sampling with replacement. We have seen above that decision trees suffer from high variance. This means that if we randomly split the training data into two parts and fit a decision tree to both, the results we get may be quite different. Bootstrap aggregation, or bagging, is a general procedure for reducing the variance of statistical learning methods.

Given a set of n independent sample observations Z_1, Z_2, ..., Z_n, each with variance *σ^*2, the variance of the mean of the sample observations is *σ^*2/*n*. In other words, averaging a set of observations reduces the variance. So a natural way to reduce the variance, and thus increase the predictive accuracy of a statistical learning method, is to take many training sets from the population, create a separate predictive model using each training set, and average the predictions.

The problem here is that we cannot get multiple training datasets. Instead, we can bootstrap by taking repeated samples from the (single) training dataset. In this approach, we generate B different bootstrapped training datasets. We then get a prediction on the bth bootstrapped training dataset, thus obtaining an aggregate prediction.

This is called bagging. Note that aggregating can have different means for regression and classification problems. While averaging the predictions works well for regression problems, we will want to use majority vote: due to the aggregating mechanism in classification problems, the overall prediction is the main class that appears most often among the B predictions.

Out-of-Bag (OOB) Error

The biggest advantage of bagging is that we can calculate the test error without cross-validation. Recall that the essence of bagging is that multiple trees are repeatedly fit to bootstrap subsets of observations. On average, each bagged tree will make use of 2/3 of the observations. The remaining 1/3 of the observations are called out-of-bag (OOB) observations, which will not fit a given bagged tree. We can calculate the prediction for the ith observation using the OOB observations of each tree, which will result in approximately B/3 predictions that can predict the ith observation. Now we can use agglomeration techniques similar to bagging (mean regression and majority voting classification) to obtain a single prediction for the ith observation. We can use this method to obtain OOB predictions for n observations, so the overall OOB MSE (for regression problems) and classification error rate (for classification problems) can be calculated. The OOB error result is a valid estimate of the test error of the bagging model, because the prediction for each example is only based on examples that were not fit to the training model.

Feature Importance Metrics

Bagging generally improves prediction accuracy by using a single tree. However, interpreting the final model can be difficult. When we bag a large number of trees, it is no longer possible to characterize the final statistical learning process using a single tree, so bagging improves prediction accuracy at the expense of interpretability. Interestingly, one can get an overall summary of each predictor using RSS (for bagging regression trees) or the Gini index (for bagging classification trees). In the case of bagging regression trees, we can record the total amount of RSS reduction due to a given predictor numerator split averaged over all B-trees. A large value indicates an important predictor. Similarly, in the case of bagging classification trees, we can add the total amount of Gini index reduction due to a given predictor numerator split averaged over all B-trees. Once training is complete, the different bagged tree learning methods of the sklearn module provide direct access to feature importance data as attributes.

Random Forest Model

While bagging improves the predictive performance of general decision trees by reducing variance, it suffers from other drawbacks: bagging requires us to grow the entire tree on bootstrap samples, which increases the computational complexity by a factor of B. Furthermore, because bagging-based trees are correlated, the predictive accuracy saturates according to B.

Random forests decorrelate all trees by random perturbations, so random forests perform better than bagging. Unlike bagging, random forests use random sample predictors before each node split when building each tree. Because the core idea of ​​random forests is the same as bagging trees, it has a reduction in variance. In addition, random forests can consider using a large number of predictors, not only because this method reduces bias, but also because local feature predictors play an important role in decision making in the tree structure.

Random forests can use a huge number of predictors, even more than the number of observations. The most significant advantage of using random forests is that it can obtain more information to reduce the bias of fitted values ​​and estimated splits.

Usually we have a few predictors that dominate the decision tree fitting process because their average performance is consistently better than some other competing predictors. Therefore, many other predictors that are useful for local data features are not selected as split variables. As the random forest calculates enough decision tree models, each predictor has at least a few chances to become the predictor that defines the split. In most cases, we do not only have dominant predictors, but also feature predictors have a chance to define the split of the dataset.

There are three main hyperparameters to tune for random forests:

  • Node size: Unlike decision trees, random forests may contain very few observations per leaf node. The goal of this hyperparameter is to keep the deviation as small as possible when generating trees.
  • Number of trees: In practice a few hundred trees is generally a good choice.
  • Number of sampled predictors: Generally speaking, if we have a total of D predictors, we can use D/3 predictors as the sample number for regression tasks and D^(1/2) predictors as the sample number for classification tasks.

Random Forest Model Example

Using the same income data as above, we now build a simple random forest classifier model with 500 trees:

rclf = RandomForestClassifier(n_estimators=500)
rclf.fit(x_train, y_train)
rclf.score(x_test, y_test)

Even without any optimization, we still find that the model performance is comparable to the optimized decision tree classifier and achieves a test score of 85.1%. According to the confusion matrix below, we find that the simple random forest and the optimized tree classifier perform similarly, achieving a prediction accuracy of 92.1% for the majority class (<=50K revenue) and 62.6% for the minority class (>50K revenue).

rclf = RandomForestClassifier(n_estimators=500)
rclf.fit(x_train, y_train)
rclf.score(x_test, y_test)

As discussed earlier, the Random Forest model also provides a measure of feature importance. We can see the importance of different features in the current model in the figure below:

importances = rclf.feature_importances_
indices = np.argsort(importances)
cols = [cols[x] for x in indices]
plt.figure(figsize=(10,6))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), cols)
plt.xlabel('Relative Importance')

Now we can try to optimize our random forest model. We can use the GridSearchCV() operation with 5-fold cross validation to optimize the random forest as follows:

parameters = {'n_estimators':(100, 500, 1000),'max_depth':(None, 24, 16),'min_samples_split': (2, 4, 8),'min_samples_leaf': (16, 4, 12)}

clf = GridSearchCV(RandomForestClassifier(), parameters, cv=5, n_jobs=8)
clf.fit(x_train, y_train)
clf.best_score_, clf.best_params_
0.86606676699118579
{'max_depth': 24,
 'min_samples_leaf': 4,
 'min_samples_split': 4,
 'n_estimators': 1000}
0.86606676699118579
{'max_depth': 24,
'min_samples_leaf': 4,
'min_samples_split': 4,
'n_estimators': 1000}

We can see that the current model is significantly better than the previous one, and has a prediction rate of 86.6%. According to the confusion matrix below, the new model has a significant improvement in the prediction accuracy of the main category, and only slightly reduced the accuracy of the prediction of the minority category. This is a common problem with imbalanced data.

rclf2 = RandomForestClassifier(n_estimators=1000,max_depth=24,min_samples_leaf=4,min_samples_split=8)
rclf2.fit(x_train, y_train)

y_pred = rclf2.predict(x_test)
cfm = confusion_matrix(y_test, y_pred, labels=[0, 1])
plt.figure(figsize=(10,6))
plot_confusion_matrix(cfm, classes=["<=50K", ">50K"], normalize=True)

Finally, the features that are important to the optimized model are shown below.

importances = rclf2.feature_importances_
indices = np.argsort(importances)
cols = [cols[x] for x in indices]
plt.figure(figsize=(10,6))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), cols)
plt.xlabel('Relative Importance')

Limitations of Random Forests

In addition to the general limitations of bagging tree models, random forests have some limitations:

  • When we need to infer out-of-range independent or dependent variables, random forests do not do well and we are better off using an algorithm like MARS.
  • The random forest algorithm is slow both in training and prediction.
  • If there are many categories to be distinguished, random forest will not perform well.

In general, random forests are generally less accurate than boosting methods on many tasks and take longer to run. Therefore, many models in Kaggle competitions use the gradient boosting tree algorithm or other excellent boosting methods.

<<:  Aiti Tribe Story Collection (25): Analysis of Chain Storage Stack and Code Implementation

>>:  75 Big Data Terms You Should Know

Recommend

Dialogue with Robin Li: When we are at the intersection of the times

[[127002]] Three elements for Baidu's success...

China Telecom lamented: China Mobile is really rich

People familiar with the matter revealed that for...

How do you keep the apples that Wang Yaping displayed in the space station fresh?

recently Photos by Zhai Zhigang and Ye Guangfu ——...

Three steps to attract private e-commerce traffic

The essence of a private domain is to have long-t...

Beware of the violent spring wind - Looking northwest, we see the wolf

1. Gentle Spring Breeze VS Violent Spring Breeze ...

Pray to gods for peace and prosperity, but do you know who these "gods" are?

Mixed Knowledge Specially designed to cure confus...

Analysis of JD.com’s 618 event promotion and operation methods!

Are there too many tricks in this year's 618 ...

Why do some phones charge faster?

The charging process of a smartphone is mainly com...

Working is actually a hundred times harder than starting a business

In fact, I really hate topics like entrepreneursh...