Big Data 202 - sklearn Decision Tree Practice

Version Matrix

ComponentVersion
scikit-learn1.8.0
python-graphviz0.21
System GraphvizLatest

1. Parameter CRITERION

criterion parameter determines impurity calculation method, sklearn provides two options:

  • Input entropy, use information entropy (Entropy)
  • Input gini, use Gini Impurity

Difference between Gini Coefficient and Information Entropy

Information Entropy formula: H(X) = -Σp(x)log₂p(x)

  • Involves logarithm operation, computational complexity higher than Gini coefficient
  • Stronger punishment for data impurity

Gini Coefficient formula: Gini = 1 - Σp(x)²

  • Simpler and more direct, no logarithm operation needed
  • Faster calculation speed

Practical Application Suggestions

  • For structured data (like tabular data), can try Gini coefficient first
  • In resource-constrained real-time systems, Gini coefficient is better choice
  • When data quality is good and dimensions are moderate (like UCI datasets), can compare effects of both metrics
  • In ensemble learning (like random forest), difference between two metrics usually gets weakened

2. Basic Modeling Code

# Import required algorithm libraries and modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

# Load data
wine = load_wine()

# Split data
Xtrain, Xtest, Ytrain, ytest = train_test_split(
    wine.data, wine.target, test_size=0.3, random_state=420
)

# Build model
clf = tree.DecisionTreeClassifier(criterion="gini")
clf = clf.fit(Xtrain, Ytrain)
clf.score(Xtest, ytest)

3. Draw Decision Tree (Graphviz Visualization)

import graphviz
from sklearn import tree

feature_name = ['Alcohol','Malic acid','Ash','Alcalinity of ash','Magnesium','Total phenols','Flavanoids',
                'Nonflavanoid phenols','Anthocyanins','Color intensity','Hue',
                'OD280/OD315 of diluted wines','Proline']

dot_data = tree.export_graphviz(
    clf,
    out_file=None,
    feature_names=feature_name,
    class_names=["Gin","Sherry","Vermouth"],
    filled=True,
    rounded=True
)
graph = graphviz.Source(dot_data)
graph

export_graphviz Parameter Description

ParameterDescription
feature_namesEach attribute name
class_namesEach dependent variable category name
labelWhether to display impurity info, default all
filledWhether to draw different colors for main classification of each node
out_fileOutput dot file name
roundedDefault True, means add rounded corners to each node border

4. Prevent Overfitting

Causes of Overfitting

Without restrictions, decision tree continues to grow until:

  1. All leaf nodes’ Gini coefficient or information entropy reach optimum (usually 0)
  2. No more features available for further data splitting
  3. Each leaf node contains only single category sample

Specific manifestations:

  • May reach 100% accuracy on training set
  • Significantly worse on test set

Root causes:

  1. Sample bias problem: Training data cannot fully represent overall data distribution
  2. Noise sensitivity problem: Decision tree captures random noise in training data

Solutions

  1. Pre-pruning:

    • Set max depth (max_depth)
    • Set minimum samples per leaf (min_samples_leaf)
  2. Post-pruning:

    • Use Cost Complexity Pruning (CCP)
    • Prune based on validation set performance
  3. Ensemble Methods:

    • Random Forest
    • Gradient Boosting Tree

5. random_state

  • Used to set random model parameters in branches
  • Default is None, randomness more obvious in high dimensions
  • Input any integer, will always grow same tree, making model stable

6. splitter

  • Input best: When branching, decision tree prioritizes more important features
  • Input random: When branching, decision tree more random, tree will be deeper and larger due to more unnecessary information

7. Common Problem Troubleshooting

SymptomRoot CauseFix
ModuleNotFoundError: No module named ‘graphviz’python-graphviz package not installedpip install graphviz
ExecutableNotFoundErrorOnly installed python package, not system GraphvizInstall system Graphviz and configure PATH
ValueError: Length of feature_namesfeature_names count inconsistent with X feature columnsUnify feature count
Training 1.0, test significantly lowerTree grows without constraint causing overfittingAdd max_depth/min_samples_leaf etc parameters