Big Data 201 - Decision Tree from Split to Pruning

Decision Tree

Decision tree is a tree-structured supervised learning model, commonly used for classification and regression tasks. Its basic idea is to hierarchically judge data into smaller and smaller subsets through a series of questions, until reaching expected target.

Local Optimum

In machine learning process of building decision tree, greedy algorithm is usually used as core splitting strategy. Basic principle of this method is: at each decision step, select optimal splitting method under current conditions, without considering impact this choice may have on subsequent steps.

Specifically, decision tree building process recursively does:

At current node, calculate information gain (or Gini impurity and other splitting criteria) for all possible features and split points
Select feature and split point that maximizes information gain
Split dataset into several subsets according to selected split
Recursively repeat above process for each subset until stop condition met

Pruning

Pruning is a method to prevent decision tree overfitting. Common pruning methods are pre-pruning and post-pruning.

Pre-pruning: During decision tree generation, estimate before each node split whether current node split can improve decision tree generalization performance, if not, stop splitting
Post-pruning: First generate a complete tree, bottom-up examine non-leaf nodes, if replacing subtree corresponding to node with leaf node can improve decision tree generalization ability, then replace

Split

Split is core process in decision tree building, refers to starting from root node, dividing data into different child nodes based on values of a certain feature.

Binary Split

Binary split is a specific splitting method that only splits node into two child nodes each time, forming a binary tree structure. Many decision tree algorithms (like CART algorithm) are built based on binary split.

Modifying Local Optimal Conditions

Using information gain as feature to split training dataset has problem of preferring features with many values
Using information gain ratio can correct this problem

Continuous Variable Processing

In C4.5, if input feature field is continuous variable, algorithm first sorts this column of numbers from small to large, then selects midpoints between adjacent two numbers as candidate split points. If a continuous variable has N values, will produce N-1 candidate split points.

Split Criteria

Classification tree: Gini criterion, similar to information gain, Gini coefficient measures impurity of a node
Regression tree: A common split criterion is Standard Deviation Reduction (SDR), similar to Least Mean Square Error (LS) criterion

Test Set and Validation Set

Split ratio of training set, test set, and validation set usually follows 6:2:2
Validation set data does not participate in modeling or model modification and optimization, only used for final model effectiveness evaluation after optimization

Error Quick Reference

Symptom	Root Cause	Fix
High training accuracy, low online/test accuracy	Tree too deep, leaves too pure causing overfitting	Pre-pruning, post-pruning, cross-validation
Always selects feature with many values first	Information gain prefers features with many values	Use gain ratio/Gini coefficient
Continuous variable splitting very slow	Many candidate split points (N-1)	First bin/sample candidate points
Serious data fragmentation after multi-way split	Multi-way split dilutes samples	Prefer binary split