Big Data 203 - sklearn Decision Tree Pruning Parameters
TL;DR
- Scenario: DecisionTreeClassifier overfitting, tree too large/memory surge, sample imbalance needs controlled pruning and weights
- Conclusion: Prioritize using max_depth + min_samples_leaf for baseline; use min_impurity_decrease from 0.19+ to replace min_impurity_split
- Output: Learning curve tuning flow (score comparison) + version matrix + common error quick reference
Version Matrix
| Parameter | Version/Description |
|---|---|
| max_depth | 0.17 / 1.8 (docs) None may grow to “leaf pure/cannot split” or constrained by min_samples_split; engineering use to first “truncate” tree, reduce overfitting and memory usage |
| min_samples_leaf | 1.8 (docs) Controls minimum samples in leaf, suppresses “leaf relying on very few samples”; official suggestion can start from 5, supports both int and float (ratio) |
| min_samples_split | 0.17 / 1.8 (docs) Controls minimum samples allowed to split node; also supports float (ratio) for adapting to different scale datasets |
| max_features | 0.17 (docs) Affects feature subset evaluated per split; note: implementation may “actually check more than max_features” to find valid split (old version docs clearly prompt) |
| min_impurity_decrease | 0.19+ / 1.8 (docs) Introduced in 0.19: only splits when impurity decrease ≥ threshold; used to replace min_impurity_split |
| min_impurity_split | 0.19 deprecated (docs) Deprecated from 0.19; in newer versions may directly report TypeError “unexpected keyword” (migrate to min_impurity_decrease) |
| class_weight | 1.8 (docs) Supports dict / “balanced”; used to change “effective weight” when sample imbalance, affects pruning threshold actual meaning |
Pruning Parameters
max_depth
Limiting tree’s maximum depth is one of most commonly used parameters in decision tree pruning, mainly used to control tree growth scale and prevent model overfitting. Specific approach is setting a maximum depth threshold, when tree growth reaches this depth it stops splitting.
This parameter is especially effective in high-dimensional feature space with small sample size. This is because each additional depth level theoretically requires exponentially growing sample size:
- At depth 2, may need at least 4 samples to support all node splits
- At depth 3, sample demand increases to 8
- At depth 4, needs 16 samples
- And so on, each additional depth level doubles sample requirement. Therefore, limiting max depth prevents model from over-learning noise when sample size is insufficient.
In practical applications, this parameter is especially useful in:
- Single decision tree model, prevent overfitting
- In ensemble algorithms (like Random Forest, GBDT), control base learner complexity
- When handling high-dimensional sparse data, ensure model generalization ability
Suggested usage method:
- Initially set max_depth=3 as baseline
- Observe model performance on validation set
- If underfitting, can gradually increase depth (like 4 or 5)
- Re-evaluate model performance after each adjustment
- Usually depth between 3-8 is reasonable, often leads to overfitting beyond 10 levels
Note: Optimal depth value varies by dataset characteristics, needs cross-validation to determine most suitable parameter. Can combine with other pruning parameters (like min_samples_split) for better results.
min_samples_leaf
min_samples_leaf is an important hyperparameter in decision tree algorithm, it controls tree’s branching behavior. Specifically:
1. Branch constraint condition:
- During tree building process, when considering a feature for branching, algorithm checks: if branching by this feature, whether each resulting child node can contain at least min_samples_leaf samples
- If doesn’t meet this condition, two situations possible:
- Branch completely prohibited (won’t happen)
- Or algorithm adjusts branch threshold so child nodes after split can meet minimum sample requirement
- Example: When min_samples_leaf=10, any branch that might result in a child node with fewer than 10 samples will be prevented
2. Synergy with other parameters:
- Usually need to work with max_depth parameter
- In regression trees, this combination can produce especially good results:
- Can effectively smooth prediction results
- Prevent tree from growing too deep causing overfitting
- Example: In house price prediction, this combination can reduce prediction result volatility
3. Parameter setting suggestions:
- Initial value suggest starting from 5
- For uneven sample distribution:
- Can use float to represent ratio (e.g., 0.1 means 10%)
- This can automatically adapt to different scale datasets
- Practical application scenarios:
- In financial risk control model, set to 50 can prevent overfitting to small groups
- In e-commerce recommendation system, set to 0.05 can maintain coverage for long-tail products
4. Parameter impact:
- Setting too small (like 1):
- May cause overfitting
- Will produce many leaf nodes containing only individual samples
- Setting too large:
- May cause underfitting
- Will limit model’s ability to capture data details
- Ideal effect:
- Ensure each leaf is sufficiently representative
- Avoid prediction results with too high variance
- In medical diagnosis model, appropriate setting can balance sensitivity and specificity
5. Tuning tips:
- Can observe model performance on validation set through learning curves
- Usually adjusted together with min_samples_split
- For large datasets (>100k samples), may need to set larger values
min_samples_split
A node must contain at least min_samples_split training samples to be allowed to split, otherwise split won’t occur.
max_features
max_features is an important hyperparameter in decision tree algorithm, it controls maximum number of features considered when splitting each node. Specifically:
1. Parameter mechanism:
- During each node split, algorithm randomly selects features not exceeding max_features count from all features for evaluation
- Features exceeding parameter setting value are directly ignored, not participating in current node’s split calculation
- Default, for classification tree usually set to sqrt(n_features), for regression tree set to n_features
2. Difference from max_depth:
- max_depth prevents overfitting by limiting tree’s vertical depth
- max_features prevents overfitting by limiting feature selection space
- Both are pre-pruning strategies, but different dimensions
3. Usage notes:
- When feature dimension is very high (e.g., >50), recommend setting this parameter
- Setting too low may cause underfitting, common value range is 0.3-0.8
- When feature importance varies greatly, may lose important features
4. Alternative solutions:
- PCA (Principal Component Analysis): Retains dimensions with largest variance through linear transformation
- ICA (Independent Component Analysis): Finds statistically independent feature representations
- Feature selection methods:
- Statistics-based methods (like chi-square test)
- Model-based methods (like L1 regularization)
- Recursive Feature Elimination (RFE)
5. Practical application examples:
- When processing text data (word vector dimension may be thousands)
- In gene expression data analysis (features far greater than samples)
- Recommend first using feature importance analysis, then decide whether to use this parameter
Best practice is to first train baseline model with complete feature set, after feature importance analysis, then decide whether to use max_features or other dimensionality reduction methods.
min_impurity_decrease
Limits information gain size, branches with information gain less than set value won’t occur. This is function updated in version 0.19, before version 0.19 used min_impurity_split.
Confirm Optimal Pruning Parameters
How to determine what value to write for each parameter? At this time, we need to use hyperparameter determination curve to judge, continue using already trained decision model CLF.
Hyperparameter learning curve is a curve with hyperparameter values as x-axis, model evaluation metric as y-axis. It’s used to measure model performance under different hyperparameter values. In our built decision tree, model evaluation metric is score.
test = []
for i in range(10):
clf = tree.DecisionTreeClassifier(max_depth=i+1,criterion="entropy",random_state=30,splitter="random")
clf = clf.fit(Xtrain, Ytrain)
score = clf.score(Xtest, ytest) # Record model score on test set under different max_depth
test.append(score)
plt.plot(range(1,11),test,color="red",label="Learning Curve")
plt.ylabel("score")
plt.xlabel("max_depth")
plt.legend()
plt.show();
Final Summary
Thinking:
- Can pruning parameters definitely improve model performance on test set? There’s no absolute answer to tuning, everything depends on data itself
- Regardless, default values of pruning parameters let tree grow endlessly, these trees may be very large on some datasets, consuming huge memory
Attributes are what can be called to view various properties of model after training. For decision tree, most important is feature_importances, can view each feature’s importance to model. Many sklearn algorithm interfaces are similar, for example we already used fit and score, almost every algorithm can use them. Besides two interfaces, decision tree’s most common interfaces are apply and predict.
- apply takes test set, returns index of leaf node where each test sample is located
- predict takes test set, returns label of each test sample, returned content is clear and easy
Must mention, all interfaces requiring input of Xtrain, Xtest parts, input feature matrix must be at least two-dimensional matrix, sklearn doesn’t accept any one-dimensional matrix as feature matrix input. If your data really has only one feature, must use reshape(-1,1) to add dimension to matrix.
Sample Imbalance Problem
For classification problem, always can’t escape one problem: sample imbalance. Sample imbalance refers to in a dataset, one label category naturally occupies large proportion, but we have need to capture some specific classification. For example, we need to classify between potential criminals and ordinary people, potential crime proportion is quite low, maybe 2%, so 98% are ordinary people, but our goal is to capture potential criminals. Such label distribution brings many problems.
First, classification models naturally tend toward majority class, making majority class easier to judge correctly, minority class sacrificed. Because for model, larger label can learn more information, algorithm will rely more on information learned from majority class to make judgments. If we want to capture minority class, model will fail.
Second, model evaluation metrics lose meaning. Under this classification situation, even if model does nothing, treating all people as non-criminals, accuracy will be very high. This makes model’s evaluation standard Accuracy meaningless, can’t achieve modeling purpose of “identifying criminals”.
So now, we need to make algorithm realize data labels are not balanced, by imposing some penalties or changing samples themselves, to make model capture minority class direction.
We can use oversampling and undersampling to achieve this, method called SMOTE. This method creates more minority class samples by recombining minority features. But these sampling methods increase total sample count. For decision tree where sample count always greatly affects computation speed, we don’t want to easily increase sample count, so we seek another path: improve our model evaluation metrics, use more minority-class-specific metrics to optimize model.
class_weight
In decision tree, exists parameter adjusting sample balance: class_weight and sample_weight that can be set in fit interface. In decision tree, parameter class_weight default None, this mode assumes all labels in dataset are balanced, i.e., automatically considers label ratio is 1:1. So when samples are imbalanced, can use:
{
"label value 1": weight1,
"label value 2": weight2
}
Use this dictionary input to let algorithm realize samples are imbalanced, or use balanced mode. With weights, sample count is no longer just record count, but affected by input weight. Therefore, at this time pruning needs to be paired with min_weight_fraction_leaf, this weight-based pruning parameter.
Also note, weight-based pruning parameters (like min_weight_fraction_leaf) will less favor dominant class than standard (like min_samples_leaf) that don’t know sample weights. If samples are weighted, then using weight-based pre-pruning criteria more easily optimizes tree structure, ensuring leaf nodes contain at least a small portion of total sample weight.
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| High training score, low test score | Tree too deep/leaves too “sharp”, typical overfitting | Limit max_depth; increase min_samples_leaf (can start from 5); if needed simultaneously increase min_samples_split |
| Model occupies large memory, slow training/inference | Default parameters cause tree “full and unpruned” | Control complexity via max_depth/min_samples_leaf/min_samples_split; if needed limit max_leaf_nodes |
| Passing min_impurity_split reports TypeError | Parameter deprecated/removed | Use min_impurity_decrease; also record sklearn version, avoid code and environment inconsistency |
| min_impurity_decrease not effective/almost no split | Threshold set too large, split entirely suppressed | Start from 0.0 and gradually increase; find inflection point with learning curve; don’t raise threshold too high alone |
| Tuning curve jitters greatly, poor reproducibility | random_state not fixed or splitter=“random” introduces fluctuation | Fix random_state; compare splitter=“best” vs “random” variance |
| Under sample imbalance, Accuracy high but “can’t capture minority class” | Wrong metric selection; model biased toward majority class | Use class_weight=“balanced” or dict with ratios; evaluate using recall/PR-AUC etc minority class metrics |
| Reports “Expected 2D array, got 1D array” | X input dimension doesn’t meet sklearn convention | For single feature reshape(-1,1); ensure Xtrain/Xtest are two-dimensional matrices |
| After setting max_features, effect worsens or unstable | Feature subset too small causing underfitting; or finding valid split checks extra features | Increase max_features or return to default; in high-dimensional sparse scenarios prioritize using max_depth/min_samples_leaf to control complexity, then try max_features |