Big Data 202 - sklearn Decision Tree Practice
Version Matrix
| Component | Version |
|---|---|
| scikit-learn | 1.8.0 |
| python-graphviz | 0.21 |
| System Graphviz | Latest |
1. Parameter CRITERION
criterion parameter determines impurity calculation method, sklearn provides two options:
- Input
entropy, use information entropy (Entropy) - Input
gini, use Gini Impurity
Difference between Gini Coefficient and Information Entropy
Information Entropy formula: H(X) = -Σp(x)log₂p(x)
- Involves logarithm operation, computational complexity higher than Gini coefficient
- Stronger punishment for data impurity
Gini Coefficient formula: Gini = 1 - Σp(x)²
- Simpler and more direct, no logarithm operation needed
- Faster calculation speed
Practical Application Suggestions
- For structured data (like tabular data), can try Gini coefficient first
- In resource-constrained real-time systems, Gini coefficient is better choice
- When data quality is good and dimensions are moderate (like UCI datasets), can compare effects of both metrics
- In ensemble learning (like random forest), difference between two metrics usually gets weakened
2. Basic Modeling Code
# Import required algorithm libraries and modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
# Load data
wine = load_wine()
# Split data
Xtrain, Xtest, Ytrain, ytest = train_test_split(
wine.data, wine.target, test_size=0.3, random_state=420
)
# Build model
clf = tree.DecisionTreeClassifier(criterion="gini")
clf = clf.fit(Xtrain, Ytrain)
clf.score(Xtest, ytest)
3. Draw Decision Tree (Graphviz Visualization)
import graphviz
from sklearn import tree
feature_name = ['Alcohol','Malic acid','Ash','Alcalinity of ash','Magnesium','Total phenols','Flavanoids',
'Nonflavanoid phenols','Anthocyanins','Color intensity','Hue',
'OD280/OD315 of diluted wines','Proline']
dot_data = tree.export_graphviz(
clf,
out_file=None,
feature_names=feature_name,
class_names=["Gin","Sherry","Vermouth"],
filled=True,
rounded=True
)
graph = graphviz.Source(dot_data)
graph
export_graphviz Parameter Description
| Parameter | Description |
|---|---|
| feature_names | Each attribute name |
| class_names | Each dependent variable category name |
| label | Whether to display impurity info, default all |
| filled | Whether to draw different colors for main classification of each node |
| out_file | Output dot file name |
| rounded | Default True, means add rounded corners to each node border |
4. Prevent Overfitting
Causes of Overfitting
Without restrictions, decision tree continues to grow until:
- All leaf nodes’ Gini coefficient or information entropy reach optimum (usually 0)
- No more features available for further data splitting
- Each leaf node contains only single category sample
Specific manifestations:
- May reach 100% accuracy on training set
- Significantly worse on test set
Root causes:
- Sample bias problem: Training data cannot fully represent overall data distribution
- Noise sensitivity problem: Decision tree captures random noise in training data
Solutions
-
Pre-pruning:
- Set max depth (max_depth)
- Set minimum samples per leaf (min_samples_leaf)
-
Post-pruning:
- Use Cost Complexity Pruning (CCP)
- Prune based on validation set performance
-
Ensemble Methods:
- Random Forest
- Gradient Boosting Tree
5. random_state
- Used to set random model parameters in branches
- Default is None, randomness more obvious in high dimensions
- Input any integer, will always grow same tree, making model stable
6. splitter
- Input best: When branching, decision tree prioritizes more important features
- Input random: When branching, decision tree more random, tree will be deeper and larger due to more unnecessary information
7. Common Problem Troubleshooting
| Symptom | Root Cause | Fix |
|---|---|---|
| ModuleNotFoundError: No module named ‘graphviz’ | python-graphviz package not installed | pip install graphviz |
| ExecutableNotFoundError | Only installed python package, not system Graphviz | Install system Graphviz and configure PATH |
| ValueError: Length of feature_names | feature_names count inconsistent with X feature columns | Unify feature count |
| Training 1.0, test significantly lower | Tree grows without constraint causing overfitting | Add max_depth/min_samples_leaf etc parameters |