Big Data 202 - sklearn Decision Tree Practice

Version Matrix

Component	Version
scikit-learn	1.8.0
python-graphviz	0.21
System Graphviz	Latest

1. Parameter CRITERION

criterion parameter determines impurity calculation method, sklearn provides two options:

Input entropy, use information entropy (Entropy)
Input gini, use Gini Impurity

Difference between Gini Coefficient and Information Entropy

Information Entropy formula: H(X) = -Σp(x)log₂p(x)

Involves logarithm operation, computational complexity higher than Gini coefficient
Stronger punishment for data impurity

Gini Coefficient formula: Gini = 1 - Σp(x)²

Simpler and more direct, no logarithm operation needed
Faster calculation speed

Practical Application Suggestions

For structured data (like tabular data), can try Gini coefficient first
In resource-constrained real-time systems, Gini coefficient is better choice
When data quality is good and dimensions are moderate (like UCI datasets), can compare effects of both metrics
In ensemble learning (like random forest), difference between two metrics usually gets weakened

2. Basic Modeling Code

# Import required algorithm libraries and modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

# Load data
wine = load_wine()

# Split data
Xtrain, Xtest, Ytrain, ytest = train_test_split(
    wine.data, wine.target, test_size=0.3, random_state=420
)

# Build model
clf = tree.DecisionTreeClassifier(criterion="gini")
clf = clf.fit(Xtrain, Ytrain)
clf.score(Xtest, ytest)

3. Draw Decision Tree (Graphviz Visualization)

import graphviz
from sklearn import tree

feature_name = ['Alcohol','Malic acid','Ash','Alcalinity of ash','Magnesium','Total phenols','Flavanoids',
                'Nonflavanoid phenols','Anthocyanins','Color intensity','Hue',
                'OD280/OD315 of diluted wines','Proline']

dot_data = tree.export_graphviz(
    clf,
    out_file=None,
    feature_names=feature_name,
    class_names=["Gin","Sherry","Vermouth"],
    filled=True,
    rounded=True
)
graph = graphviz.Source(dot_data)
graph

export_graphviz Parameter Description

Parameter	Description
feature_names	Each attribute name
class_names	Each dependent variable category name
label	Whether to display impurity info, default all
filled	Whether to draw different colors for main classification of each node
out_file	Output dot file name
rounded	Default True, means add rounded corners to each node border

4. Prevent Overfitting

Causes of Overfitting

Without restrictions, decision tree continues to grow until:

All leaf nodes’ Gini coefficient or information entropy reach optimum (usually 0)
No more features available for further data splitting
Each leaf node contains only single category sample

Specific manifestations:

May reach 100% accuracy on training set
Significantly worse on test set

Root causes:

Sample bias problem: Training data cannot fully represent overall data distribution
Noise sensitivity problem: Decision tree captures random noise in training data

Solutions

Pre-pruning:
- Set max depth (max_depth)
- Set minimum samples per leaf (min_samples_leaf)
Post-pruning:
- Use Cost Complexity Pruning (CCP)
- Prune based on validation set performance
Ensemble Methods:
- Random Forest
- Gradient Boosting Tree

5. random_state

Used to set random model parameters in branches
Default is None, randomness more obvious in high dimensions
Input any integer, will always grow same tree, making model stable

6. splitter

Input best: When branching, decision tree prioritizes more important features
Input random: When branching, decision tree more random, tree will be deeper and larger due to more unnecessary information

7. Common Problem Troubleshooting

Symptom	Root Cause	Fix
ModuleNotFoundError: No module named ‘graphviz’	python-graphviz package not installed	pip install graphviz
ExecutableNotFoundError	Only installed python package, not system Graphviz	Install system Graphviz and configure PATH
ValueError: Length of feature_names	feature_names count inconsistent with X feature columns	Unify feature count
Training 1.0, test significantly lower	Tree grows without constraint causing overfitting	Add max_depth/min_samples_leaf etc parameters