Big Data 201 - Decision Tree from Split to Pruning
Decision Tree
Decision tree is a tree-structured supervised learning model, commonly used for classification and regression tasks. Its basic idea is to hierarchically judge data into smaller and smaller subsets through a series of questions, until reaching expected target.
Local Optimum
In machine learning process of building decision tree, greedy algorithm is usually used as core splitting strategy. Basic principle of this method is: at each decision step, select optimal splitting method under current conditions, without considering impact this choice may have on subsequent steps.
Specifically, decision tree building process recursively does:
- At current node, calculate information gain (or Gini impurity and other splitting criteria) for all possible features and split points
- Select feature and split point that maximizes information gain
- Split dataset into several subsets according to selected split
- Recursively repeat above process for each subset until stop condition met
Pruning
Pruning is a method to prevent decision tree overfitting. Common pruning methods are pre-pruning and post-pruning.
- Pre-pruning: During decision tree generation, estimate before each node split whether current node split can improve decision tree generalization performance, if not, stop splitting
- Post-pruning: First generate a complete tree, bottom-up examine non-leaf nodes, if replacing subtree corresponding to node with leaf node can improve decision tree generalization ability, then replace
Split
Split is core process in decision tree building, refers to starting from root node, dividing data into different child nodes based on values of a certain feature.
Binary Split
Binary split is a specific splitting method that only splits node into two child nodes each time, forming a binary tree structure. Many decision tree algorithms (like CART algorithm) are built based on binary split.
Modifying Local Optimal Conditions
- Using information gain as feature to split training dataset has problem of preferring features with many values
- Using information gain ratio can correct this problem
Continuous Variable Processing
In C4.5, if input feature field is continuous variable, algorithm first sorts this column of numbers from small to large, then selects midpoints between adjacent two numbers as candidate split points. If a continuous variable has N values, will produce N-1 candidate split points.
Split Criteria
- Classification tree: Gini criterion, similar to information gain, Gini coefficient measures impurity of a node
- Regression tree: A common split criterion is Standard Deviation Reduction (SDR), similar to Least Mean Square Error (LS) criterion
Test Set and Validation Set
- Split ratio of training set, test set, and validation set usually follows 6:2:2
- Validation set data does not participate in modeling or model modification and optimization, only used for final model effectiveness evaluation after optimization
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| High training accuracy, low online/test accuracy | Tree too deep, leaves too pure causing overfitting | Pre-pruning, post-pruning, cross-validation |
| Always selects feature with many values first | Information gain prefers features with many values | Use gain ratio/Gini coefficient |
| Continuous variable splitting very slow | Many candidate split points (N-1) | First bin/sample candidate points |
| Serious data fragmentation after multi-way split | Multi-way split dilutes samples | Prefer binary split |