Big Data 199 - Decision Tree Model
Decision Tree Model
Tree model is a widely used algorithm type in supervised learning, can be applied to both classification and regression problems. Among them, tree models used for classification problems are often called classification trees, and tree models used for regression problems are called regression trees.
Tree models find best classification criteria through recursive cutting method, then finally form rules. Although algorithm principle is simple, model itself has wide application range, performs well on both classification and regression problems, plus simple to use, no need for excessive variable adjustment and data preprocessing, also generates clear rules, model itself has very strong interpretability, therefore widely applied in various industries.
Decision Tree is a hierarchical data structure implementing divide-and-conquer strategy, it’s an effective non-parametric learning method, can be used for classification and regression, we mainly discuss classification decision tree. Classification decision tree model represents an attribute structure for classifying instances based on features (including binary tree and multi-way tree).
Decision Tree Nodes
Decision tree consists of nodes (Node) and directed edges (Directed edge), tree contains three types of nodes:
- Root Node: Contains entire sample set, has no incoming edge, but has zero or more outgoing edges
- Internal Node: Corresponds to attribute test condition, exactly one incoming edge, and two or more outgoing edges
- Leaf Node or Terminal Node: Corresponds to decision result, exactly one incoming edge, but no outgoing edge
Decision Tree Basic Flow
Path from root node to each leaf node corresponds to a test sequence, its basic flow follows simple and intuitive “divide and conquer” strategy. Thus, local region determined through few recursive splits, each decision node implements an attribute test function with discrete output, marking branches.
Decision and Conditional Probability Distribution
Decision tree is a machine learning model based on probability statistics, its essence can be understood as modeling category variable’s conditional probability distribution under given decision nodes.
-
Probability distribution definition and partitioning:
- Decision tree divides feature space X into several non-overlapping regions R1, R2,…, Rm
- Defines conditional probability distribution P(Y|X ∈ Rj) on each region Rj
- This probability distribution is defined on all possible values of category variable Y
-
Probabilistic explanation of tree structure:
- Root node corresponds to initial partitioning of entire feature space
- Each internal decision node represents further subdivision of current region
- As path extends from root to leaf, feature space is recursively divided into smaller sub-regions
- Finally each leaf node corresponds to a specific region Rj of feature space
-
Characteristics of conditional probability:
- During training, estimate P(Y|X∈Rj) by counting sample category distribution falling into each region
- Usually, conditional probability on leaf nodes shows obvious bias, i.e., probability of some category significantly higher than others
- For example, in binary classification, some leaf node might get P(Y=1|X∈Rj)=0.85, P(Y=0|X∈Rj)=0.15
-
Classification decision process:
- For new sample x, decision tree routes it to corresponding leaf node (region Rj)
- According to that node’s conditional probability distribution, classify using Maximum A Posteriori (MAP) criterion:
ŷ = argmax_y P(Y=y|X ∈ Rj)- This means even if some category probability is not 100%, decision tree still forces this sample into category with highest probability
This probability interpretation framework provides theoretical basis for understanding how decision tree works, also explains why decision tree easily reaches 100% accuracy on training data (by continuously subdividing until each leaf node contains only single category sample).
Learning Algorithm
Decision tree learning is a machine learning method for constructing classification rules by inducing training dataset, also called “tree induction” process.
-
Model selection principle:
- For same training dataset, there may exist countless decision trees that can perfectly classify. We pursue among them the “smallest” tree
- From probability perspective, decision tree actually estimates conditional probability model
-
Algorithm execution process:
- Algorithm starts with entire training dataset at root node
- At each split step, algorithm finds optimal split point for continuous attributes, for discrete attributes splits by attribute values
- Split criteria usually based on: Information Gain (ID3 algorithm), Information Gain Ratio (C4.5 algorithm), Gini Index (CART algorithm)
- Recursively repeat above process for each subset until stop condition met
-
Stop conditions:
- Current subset samples basically belong to same category (e.g., over 95%)
- No suitable features available for further splitting
- Reached preset tree depth limit
Decision tree learning algorithm includes feature selection, decision tree generation and decision tree pruning. Since decision tree represents conditional probability distribution, decision trees of different depths correspond to probability models of different complexity. Among them, decision tree generation only considers local optimum,相对的, decision tree pruning considers global optimum.
Calculation of Shannon Entropy
Key to decision tree learning is how to select optimal splitting attribute. Generally, as splitting process continues, we hope decision tree’s branch nodes contain samples of same category as much as possible, i.e., node purity (purity) increasingly higher.
In information theory and probability statistics, entropy is measure of uncertainty of random variable. Here we use entropy also called Shannon entropy, name source from father of information theory Claude Shannon.
Shannon entropy formula:
H(X) = -∑p(xi)log₂(p(xi))
For binary classification:
- If p1=1 and p2=0, entropy is 0 (certainty)
- If p1 = p2 = 0.5, entropy is 1 (maximum uncertainty)
Other Impurity Measures
Besides Shannon entropy, other commonly used impurity measures:
-
Gini Index:
Gini = 1 - ∑p(xi)² -
Misclassification Error:
Error = 1 - max(p)
All three methods reach maximum when class distribution is balanced (i.e., when p=0.5), and reach minimum when all records belong to same class (p=1 or p=0).
Python Code Implementation of Shannon Entropy
import pandas as pd
import numpy as np
row_data = {
'Accompanied': [0,0,0,1,1],
'Played Games': [1,1,0,1,1],
'Bad Boy': ['Yes','Yes','Not','Not','Not']
}
dataSet = pd.DataFrame(row_data)
def calEnt(dataSet):
# Total rows in dataset
n = dataSet.shape[0]
# All categories of labels
iset = dataSet.iloc[:,-1].value_counts()
# Proportion of each label class
p = iset/n
# Calculate information entropy
ent = (-p*np.log2(p)).sum()
return ent
calEnt(dataSet)
Higher entropy means higher information impurity, i.e., more mixed data. That is, just from judgment result, if you guess randomly from these 5 people, it’s not easy to accurately judge whether someone is a “bad boy”.
Common Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| Training set accuracy extremely high, test set significantly lower | Tree too deep, leaf samples too small, continuous splitting causes overfitting | Pre-pruning (max_depth/min_samples_leaf), post-pruning |
| Split result unstable, small data perturbation causes major tree structure change | Decision tree is high variance model | Use ensemble methods (RandomForest/GBDT) or limit depth |
| Continuous feature split effect poor | Continuous variable split point selection not good | Check candidate split point enumeration logic; use impurity-based optimal split point search |
| With extreme class imbalance, leaf node probability biased obviously | Leaf node probability dominated by majority class | Cost-sensitive learning/class weights/resampling |
| Entropy calculation gets NaN/inf | Probability 0 causes log(0) | Add tiny epsilon value to probability |