Big Data 201 - Decision Tree from Split to Pruning

Decision Tree

Decision tree is a tree-structured supervised learning model, commonly used for classification and regression tasks. Its basic idea is to hierarchically judge data into smaller and smaller subsets through a series of questions, until reaching expected target.

Local Optimum

In machine learning process of building decision tree, greedy algorithm is usually used as core splitting strategy. Basic principle of this method is: at each decision step, select optimal splitting method under current conditions, without considering impact this choice may have on subsequent steps.

Specifically, decision tree building process recursively does:

  1. At current node, calculate information gain (or Gini impurity and other splitting criteria) for all possible features and split points
  2. Select feature and split point that maximizes information gain
  3. Split dataset into several subsets according to selected split
  4. Recursively repeat above process for each subset until stop condition met

Pruning

Pruning is a method to prevent decision tree overfitting. Common pruning methods are pre-pruning and post-pruning.

  • Pre-pruning: During decision tree generation, estimate before each node split whether current node split can improve decision tree generalization performance, if not, stop splitting
  • Post-pruning: First generate a complete tree, bottom-up examine non-leaf nodes, if replacing subtree corresponding to node with leaf node can improve decision tree generalization ability, then replace

Split

Split is core process in decision tree building, refers to starting from root node, dividing data into different child nodes based on values of a certain feature.

Binary Split

Binary split is a specific splitting method that only splits node into two child nodes each time, forming a binary tree structure. Many decision tree algorithms (like CART algorithm) are built based on binary split.

Modifying Local Optimal Conditions

  • Using information gain as feature to split training dataset has problem of preferring features with many values
  • Using information gain ratio can correct this problem

Continuous Variable Processing

In C4.5, if input feature field is continuous variable, algorithm first sorts this column of numbers from small to large, then selects midpoints between adjacent two numbers as candidate split points. If a continuous variable has N values, will produce N-1 candidate split points.

Split Criteria

  • Classification tree: Gini criterion, similar to information gain, Gini coefficient measures impurity of a node
  • Regression tree: A common split criterion is Standard Deviation Reduction (SDR), similar to Least Mean Square Error (LS) criterion

Test Set and Validation Set

  • Split ratio of training set, test set, and validation set usually follows 6:2:2
  • Validation set data does not participate in modeling or model modification and optimization, only used for final model effectiveness evaluation after optimization

Error Quick Reference

SymptomRoot CauseFix
High training accuracy, low online/test accuracyTree too deep, leaves too pure causing overfittingPre-pruning, post-pruning, cross-validation
Always selects feature with many values firstInformation gain prefers features with many valuesUse gain ratio/Gini coefficient
Continuous variable splitting very slowMany candidate split points (N-1)First bin/sample candidate points
Serious data fragmentation after multi-way splitMulti-way split dilutes samplesPrefer binary split