How to prevent overfitting in decision tree

Table of Contents Show

– Pre-Pruning
– Post-Prunning
Ensemble – Random Forest:

I was quite interested to learn that sklearn’s Decision Tree algorithm has several parameters in its coding that prevent overfitting. Some of these parameters are min_leaf_sample and max_depth that work together to prevent overfitting when data is trained. Cost complexity pruning provides another option to control the size of a tree. This pruning technique is parameterised by the cost…

Figure 9.1

Figure 9.2

Figure 9.3

Figure 9.4

Figure 9.5

Figure 9.6

Figure 9.7

Figure 9.8

Figure 9.9

Figure 9.10

Figure 9.11

Figure 9.12

Figure 9.13

Figure 9.14

Figure 9.15

Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo: Morgan Kaufmann.

Google Scholar
Esposito, F., Malerba, D., & Semeraro, G. (1997). A comparative analysis of methods for pruning decision trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5), 476–491.
CrossRef Google Scholar

Download references

As decision trees are a non-parametric supervised implementation technique for classification as well as regression tasks, overfitting on the training datasets is a common problem. Given the architecture of the model itself, if the model is allowed to be trained to its full strength, the model is almost always going to overfit the training data. Fortunately, there are various techniques that are available to avoid and prevent overfitting in decision trees. The following are some of the commonly used techniques to avoid overfitting:

Pruning

Decision tree models are usually allowed to grow to their maximum depth. As discussed above, it is usually going to cause overfitting in the model, which is undesirable. To prevent this, we use pruning, which refers to the method of removing some parts of a given tree to prevent its growth to its full depth. This is usually achieved by hyperparameters tuning to optimal values. There are two types of pruning that are used in decision trees:

– Pre-Pruning

This technique refers to the early stopping mechanism, where we do not allow the training process to go through,consequently preventing the overfitting of the model. It involves tuning the hyperparameters like, depth, minimum samples, and minimum sample split. These values can be tuned to ensure that we are able to achieve early stopping. Sklearn module for decision trees has these arguments built-in and they can be fine-tuned and changed easily for experiments to achieve optimal results.

– Post-Prunning

This technique allows decision trees to grow to their full depth in the training process, then starts removing the branches of the trees to prevent the model from overfitting. CCP (Cost Complexity Pruning) is one of the most prominent techniques used for post-pruning, CCP Alpha is the parameter being used for controlling the post-pruning process, with the increase in the value for ccp_alpha, more nodes from a given tree are pruned. The process is continued until we are able to achieve an optimum value where the drop in the accuracy of the tree takes a significant nosedive on the holdout dataset.

Ensemble – Random Forest:

Random Forests is an ensemble method of implementation of tree-based algorithms used for both classification and regression. It uses bootstrapping multiple decision trees to prevent overfitting by sampling and aggregation techniques.

Open source package for ml validation