Big Data 203 - sklearn Decision Tree Pruning Parameters

TL;DR

Scenario: DecisionTreeClassifier overfitting, tree too large/memory surge, sample imbalance needs controlled pruning and weights
Conclusion: Prioritize using max_depth + min_samples_leaf for baseline; use min_impurity_decrease from 0.19+ to replace min_impurity_split
Output: Learning curve tuning flow (score comparison) + version matrix + common error quick reference

Version Matrix

Parameter	Version/Description
max_depth	0.17 / 1.8 (docs) None may grow to “leaf pure/cannot split” or constrained by min_samples_split; engineering use to first “truncate” tree, reduce overfitting and memory usage
min_samples_leaf	1.8 (docs) Controls minimum samples in leaf, suppresses “leaf relying on very few samples”; official suggestion can start from 5, supports both int and float (ratio)
min_samples_split	0.17 / 1.8 (docs) Controls minimum samples allowed to split node; also supports float (ratio) for adapting to different scale datasets
max_features	0.17 (docs) Affects feature subset evaluated per split; note: implementation may “actually check more than max_features” to find valid split (old version docs clearly prompt)
min_impurity_decrease	0.19+ / 1.8 (docs) Introduced in 0.19: only splits when impurity decrease ≥ threshold; used to replace min_impurity_split
min_impurity_split	0.19 deprecated (docs) Deprecated from 0.19; in newer versions may directly report TypeError “unexpected keyword” (migrate to min_impurity_decrease)
class_weight	1.8 (docs) Supports dict / “balanced”; used to change “effective weight” when sample imbalance, affects pruning threshold actual meaning

Pruning Parameters

max_depth

Limiting tree’s maximum depth is one of most commonly used parameters in decision tree pruning, mainly used to control tree growth scale and prevent model overfitting. Specific approach is setting a maximum depth threshold, when tree growth reaches this depth it stops splitting.

This parameter is especially effective in high-dimensional feature space with small sample size. This is because each additional depth level theoretically requires exponentially growing sample size:

At depth 2, may need at least 4 samples to support all node splits
At depth 3, sample demand increases to 8
At depth 4, needs 16 samples
And so on, each additional depth level doubles sample requirement. Therefore, limiting max depth prevents model from over-learning noise when sample size is insufficient.

In practical applications, this parameter is especially useful in:

Single decision tree model, prevent overfitting
In ensemble algorithms (like Random Forest, GBDT), control base learner complexity
When handling high-dimensional sparse data, ensure model generalization ability

Suggested usage method:

Initially set max_depth=3 as baseline
Observe model performance on validation set
If underfitting, can gradually increase depth (like 4 or 5)
Re-evaluate model performance after each adjustment
Usually depth between 3-8 is reasonable, often leads to overfitting beyond 10 levels

Note: Optimal depth value varies by dataset characteristics, needs cross-validation to determine most suitable parameter. Can combine with other pruning parameters (like min_samples_split) for better results.

min_samples_leaf

min_samples_leaf is an important hyperparameter in decision tree algorithm, it controls tree’s branching behavior. Specifically:

1. Branch constraint condition:

During tree building process, when considering a feature for branching, algorithm checks: if branching by this feature, whether each resulting child node can contain at least min_samples_leaf samples
If doesn’t meet this condition, two situations possible:
- Branch completely prohibited (won’t happen)
- Or algorithm adjusts branch threshold so child nodes after split can meet minimum sample requirement
Example: When min_samples_leaf=10, any branch that might result in a child node with fewer than 10 samples will be prevented

2. Synergy with other parameters:

Usually need to work with max_depth parameter
In regression trees, this combination can produce especially good results:
- Can effectively smooth prediction results
- Prevent tree from growing too deep causing overfitting
- Example: In house price prediction, this combination can reduce prediction result volatility

3. Parameter setting suggestions:

Initial value suggest starting from 5
For uneven sample distribution:
- Can use float to represent ratio (e.g., 0.1 means 10%)
- This can automatically adapt to different scale datasets
Practical application scenarios:
- In financial risk control model, set to 50 can prevent overfitting to small groups
- In e-commerce recommendation system, set to 0.05 can maintain coverage for long-tail products

4. Parameter impact:

Setting too small (like 1):
- May cause overfitting
- Will produce many leaf nodes containing only individual samples
Setting too large:
- May cause underfitting
- Will limit model’s ability to capture data details
Ideal effect:
- Ensure each leaf is sufficiently representative
- Avoid prediction results with too high variance
- In medical diagnosis model, appropriate setting can balance sensitivity and specificity

5. Tuning tips:

Can observe model performance on validation set through learning curves
Usually adjusted together with min_samples_split
For large datasets (>100k samples), may need to set larger values

min_samples_split

A node must contain at least min_samples_split training samples to be allowed to split, otherwise split won’t occur.

max_features

max_features is an important hyperparameter in decision tree algorithm, it controls maximum number of features considered when splitting each node. Specifically:

1. Parameter mechanism:

During each node split, algorithm randomly selects features not exceeding max_features count from all features for evaluation
Features exceeding parameter setting value are directly ignored, not participating in current node’s split calculation
Default, for classification tree usually set to sqrt(n_features), for regression tree set to n_features

2. Difference from max_depth:

max_depth prevents overfitting by limiting tree’s vertical depth
max_features prevents overfitting by limiting feature selection space
Both are pre-pruning strategies, but different dimensions

3. Usage notes:

When feature dimension is very high (e.g., >50), recommend setting this parameter
Setting too low may cause underfitting, common value range is 0.3-0.8
When feature importance varies greatly, may lose important features

4. Alternative solutions:

PCA (Principal Component Analysis): Retains dimensions with largest variance through linear transformation
ICA (Independent Component Analysis): Finds statistically independent feature representations
Feature selection methods:
- Statistics-based methods (like chi-square test)
- Model-based methods (like L1 regularization)
- Recursive Feature Elimination (RFE)

5. Practical application examples:

When processing text data (word vector dimension may be thousands)
In gene expression data analysis (features far greater than samples)
Recommend first using feature importance analysis, then decide whether to use this parameter

Best practice is to first train baseline model with complete feature set, after feature importance analysis, then decide whether to use max_features or other dimensionality reduction methods.

min_impurity_decrease

Limits information gain size, branches with information gain less than set value won’t occur. This is function updated in version 0.19, before version 0.19 used min_impurity_split.

Confirm Optimal Pruning Parameters

How to determine what value to write for each parameter? At this time, we need to use hyperparameter determination curve to judge, continue using already trained decision model CLF.

Hyperparameter learning curve is a curve with hyperparameter values as x-axis, model evaluation metric as y-axis. It’s used to measure model performance under different hyperparameter values. In our built decision tree, model evaluation metric is score.

test = []
for i in range(10):
    clf = tree.DecisionTreeClassifier(max_depth=i+1,criterion="entropy",random_state=30,splitter="random")
    clf = clf.fit(Xtrain, Ytrain)
    score = clf.score(Xtest, ytest) # Record model score on test set under different max_depth
    test.append(score)
plt.plot(range(1,11),test,color="red",label="Learning Curve")
plt.ylabel("score")
plt.xlabel("max_depth")
plt.legend()
plt.show();

Final Summary

Thinking:

Can pruning parameters definitely improve model performance on test set? There’s no absolute answer to tuning, everything depends on data itself
Regardless, default values of pruning parameters let tree grow endlessly, these trees may be very large on some datasets, consuming huge memory

Attributes are what can be called to view various properties of model after training. For decision tree, most important is feature_importances, can view each feature’s importance to model. Many sklearn algorithm interfaces are similar, for example we already used fit and score, almost every algorithm can use them. Besides two interfaces, decision tree’s most common interfaces are apply and predict.

apply takes test set, returns index of leaf node where each test sample is located
predict takes test set, returns label of each test sample, returned content is clear and easy

Must mention, all interfaces requiring input of Xtrain, Xtest parts, input feature matrix must be at least two-dimensional matrix, sklearn doesn’t accept any one-dimensional matrix as feature matrix input. If your data really has only one feature, must use reshape(-1,1) to add dimension to matrix.

Sample Imbalance Problem

For classification problem, always can’t escape one problem: sample imbalance. Sample imbalance refers to in a dataset, one label category naturally occupies large proportion, but we have need to capture some specific classification. For example, we need to classify between potential criminals and ordinary people, potential crime proportion is quite low, maybe 2%, so 98% are ordinary people, but our goal is to capture potential criminals. Such label distribution brings many problems.

First, classification models naturally tend toward majority class, making majority class easier to judge correctly, minority class sacrificed. Because for model, larger label can learn more information, algorithm will rely more on information learned from majority class to make judgments. If we want to capture minority class, model will fail.

Second, model evaluation metrics lose meaning. Under this classification situation, even if model does nothing, treating all people as non-criminals, accuracy will be very high. This makes model’s evaluation standard Accuracy meaningless, can’t achieve modeling purpose of “identifying criminals”.

So now, we need to make algorithm realize data labels are not balanced, by imposing some penalties or changing samples themselves, to make model capture minority class direction.

We can use oversampling and undersampling to achieve this, method called SMOTE. This method creates more minority class samples by recombining minority features. But these sampling methods increase total sample count. For decision tree where sample count always greatly affects computation speed, we don’t want to easily increase sample count, so we seek another path: improve our model evaluation metrics, use more minority-class-specific metrics to optimize model.

class_weight

In decision tree, exists parameter adjusting sample balance: class_weight and sample_weight that can be set in fit interface. In decision tree, parameter class_weight default None, this mode assumes all labels in dataset are balanced, i.e., automatically considers label ratio is 1:1. So when samples are imbalanced, can use:

{
  "label value 1": weight1,
  "label value 2": weight2
}

Use this dictionary input to let algorithm realize samples are imbalanced, or use balanced mode. With weights, sample count is no longer just record count, but affected by input weight. Therefore, at this time pruning needs to be paired with min_weight_fraction_leaf, this weight-based pruning parameter.

Also note, weight-based pruning parameters (like min_weight_fraction_leaf) will less favor dominant class than standard (like min_samples_leaf) that don’t know sample weights. If samples are weighted, then using weight-based pre-pruning criteria more easily optimizes tree structure, ensuring leaf nodes contain at least a small portion of total sample weight.

Error Quick Reference

Symptom	Root Cause	Fix
High training score, low test score	Tree too deep/leaves too “sharp”, typical overfitting	Limit max_depth; increase min_samples_leaf (can start from 5); if needed simultaneously increase min_samples_split
Model occupies large memory, slow training/inference	Default parameters cause tree “full and unpruned”	Control complexity via max_depth/min_samples_leaf/min_samples_split; if needed limit max_leaf_nodes
Passing min_impurity_split reports TypeError	Parameter deprecated/removed	Use min_impurity_decrease; also record sklearn version, avoid code and environment inconsistency
min_impurity_decrease not effective/almost no split	Threshold set too large, split entirely suppressed	Start from 0.0 and gradually increase; find inflection point with learning curve; don’t raise threshold too high alone
Tuning curve jitters greatly, poor reproducibility	random_state not fixed or splitter=“random” introduces fluctuation	Fix random_state; compare splitter=“best” vs “random” variance
Under sample imbalance, Accuracy high but “can’t capture minority class”	Wrong metric selection; model biased toward majority class	Use class_weight=“balanced” or dict with ratios; evaluate using recall/PR-AUC etc minority class metrics
Reports “Expected 2D array, got 1D array”	X input dimension doesn’t meet sklearn convention	For single feature reshape(-1,1); ensure Xtrain/Xtest are two-dimensional matrices
After setting max_features, effect worsens or unstable	Feature subset too small causing underfitting; or finding valid split checks extra features	Increase max_features or return to default; in high-dimensional sparse scenarios prioritize using max_depth/min_samples_leaf to control complexity, then try max_features