Gradient Boosting
Gradient Boosting is an improved algorithm of Boosting Tree, so before explaining gradient boosting tree, let’s first talk about boosting tree.
Example to understand: Suppose someone is 30 years old, we first use 20 to fit, find loss is 10 years, then we use 6 to fit remaining loss, find gap is still 4 years, third round we use 3 to fit remaining gap, gap is only 1 year. If iteration rounds are not finished, can continue iterating, each iteration’s fitting error will decrease, finally sum each fitting value is model output
Boosting Tree Algorithm:
Initialize f₀(x) = 0
For m = 1,2…M, calculate residual rₘᵢ = yᵢ - fₘ₋₁(x), i = 1,2…N
Fit residual rₘᵢ to learn a regression tree, get hₘ(x)
Update fₘ(x) = fₘ₋₁(x) + hₘ(x)
Get regression problem boosting tree
What is residual in the above pseudocode?
In boosting tree algorithm:
Assume the strong learner obtained from previous iteration is: fₜ₋₁(x)
Loss function is: L(y, fₜ₋₁(x))
This iteration’s goal is to find a weak learner: hₜ(x)
Minimize this round’s loss: L(y, fₜ(x)) = L(y, fₜ₋₁(x) + hₜ(x))
When using squared loss function, here, r = y - fₜ₋₁(x) is the residual that current model fits data. So for boosting tree, only need to simply fit current model’s residual.
When loss function is squared loss and exponential loss function, each step optimization in gradient boosting tree is very simple, but for general loss functions, each step optimization is often not easy. For this problem, Friedman proposed gradient boosting tree algorithm, which uses steepest descent approximation method, the key is using negative gradient of loss function as approximate residual value in boosting tree.
What does negative gradient look like?
The i-th sample’s loss function negative gradient in round t is:
At this time, different loss functions will get different negative gradients, if choose squared loss, negative gradient is residual. So for regression problem, we need to fit residual. Then what about classification problem? Both binary classification and multi-classification loss functions are logloss.
GBDT Algorithm Principle
Combine Decision Tree and Gradient Boosting these two parts to get GBDT.
GBDT (Gradient Boosting Decision Tree) is a type of gradually additive model + gradient descent optimized ensemble learning algorithm.
- Boosting: Train multiple weak learners (generally shallow decision trees) in sequence, next model specifically makes up for errors of previous model.
- Gradient: Formalize “how to correct error” as gradient descent on loss function; each step takes a small step in negative gradient direction in function space.
- Decision Tree: Most commonly used weak learner is CART regression tree (can handle regression or classification with proper encoding).
Initialize weak learner
For m = 1,2…M
For each sample i = 1,2…N calculate negative gradient (i.e., residual)
Use residual from above step as new true value of sample, and data (xᵢ, rᵢₘ), i = 1,2…N as training data for next tree, get new regression tree fₘ(x) whose leaf node regions are Rⱼₘ, j = 1,2…J. Where J is number of leaf nodes in regression tree t.
For leaf region j = 1,2…J calculate best fit value
Update strong learner
Get final learner
Why It Works Well
- Arbitrary differentiable loss: Process regression, binary classification, multi-classification, ranking, anomaly detection (custom loss) in unified framework
- High-order nonlinearity: Trees naturally capture feature interactions, no explicit cross feature engineering needed
- Scale invariant: No standardization needed for features of different magnitudes
- Interpretable: Based on tree internal structure, can do feature importance, SHAP interpretation
Common Efficient Implementations
- XGBoost: Second-order approximation, sparsity aware, Block approximate histogram, L1/L2 regularization
- LightGBM: GOSS row sampling + EFB feature parallel, leaf-wise growth, high speed + low memory consumption
- CatBoost: Ordered target encoding for categorical features, built-in symmetric tree, text feature support
All do large amount of engineering and algorithm acceleration on classic GBDT framework, but core idea (negative gradient + shallow tree stacking) remains unchanged.
When to Use / When Not to Use
- Suitable: Structured tabular data, complex nonlinear relationships between features, don’t want to do lots of feature engineering
- Not suitable: High-dimensional sparse text, images, speech and other deep learning dominant scenarios; when data volume is huge but features are simple, linear models are faster
Code Example
from xgboost import XGBClassifier
model = XGBClassifier(
n_estimators=300,
max_depth=6,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
objective='binary:logistic',
eval_metric='auc'
)
model.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
early_stopping_rounds=50)