Spark MLlib GBDT Algorithm: Gradient Boosting Principles,...

Gradient Boosting

Gradient Boosting is an improved algorithm of Boosting Tree, so before explaining gradient boosting tree, let’s first talk about boosting tree.

Example to understand: Suppose someone is 30 years old, we first use 20 to fit, find loss is 10 years, then we use 6 to fit remaining loss, find gap is still 4 years, third round we use 3 to fit remaining gap, gap is only 1 year. If iteration rounds are not finished, can continue iterating, each iteration’s fitting error will decrease, finally sum each fitting value is model output

Boosting Tree Algorithm:

Initialize f₀(x) = 0

For m = 1,2…M, calculate residual rₘᵢ = yᵢ - fₘ₋₁(x), i = 1,2…N

Fit residual rₘᵢ to learn a regression tree, get hₘ(x)

Update fₘ(x) = fₘ₋₁(x) + hₘ(x)

Get regression problem boosting tree

What is residual in the above pseudocode?

In boosting tree algorithm:

Assume the strong learner obtained from previous iteration is: fₜ₋₁(x)

Loss function is: L(y, fₜ₋₁(x))

This iteration’s goal is to find a weak learner: hₜ(x)

Minimize this round’s loss: L(y, fₜ(x)) = L(y, fₜ₋₁(x) + hₜ(x))

When using squared loss function, here, r = y - fₜ₋₁(x) is the residual that current model fits data. So for boosting tree, only need to simply fit current model’s residual.

When loss function is squared loss and exponential loss function, each step optimization in gradient boosting tree is very simple, but for general loss functions, each step optimization is often not easy. For this problem, Friedman proposed gradient boosting tree algorithm, which uses steepest descent approximation method, the key is using negative gradient of loss function as approximate residual value in boosting tree.

What does negative gradient look like?

The i-th sample’s loss function negative gradient in round t is:

At this time, different loss functions will get different negative gradients, if choose squared loss, negative gradient is residual. So for regression problem, we need to fit residual. Then what about classification problem? Both binary classification and multi-classification loss functions are logloss.

GBDT Algorithm Principle

Combine Decision Tree and Gradient Boosting these two parts to get GBDT.

GBDT (Gradient Boosting Decision Tree) is a type of gradually additive model + gradient descent optimized ensemble learning algorithm.

Boosting: Train multiple weak learners (generally shallow decision trees) in sequence, next model specifically makes up for errors of previous model.
Gradient: Formalize “how to correct error” as gradient descent on loss function; each step takes a small step in negative gradient direction in function space.
Decision Tree: Most commonly used weak learner is CART regression tree (can handle regression or classification with proper encoding).

Initialize weak learner

For m = 1,2…M

For each sample i = 1,2…N calculate negative gradient (i.e., residual)

Use residual from above step as new true value of sample, and data (xᵢ, rᵢₘ), i = 1,2…N as training data for next tree, get new regression tree fₘ(x) whose leaf node regions are Rⱼₘ, j = 1,2…J. Where J is number of leaf nodes in regression tree t.

For leaf region j = 1,2…J calculate best fit value

Update strong learner

Get final learner

Why It Works Well

Arbitrary differentiable loss: Process regression, binary classification, multi-classification, ranking, anomaly detection (custom loss) in unified framework
High-order nonlinearity: Trees naturally capture feature interactions, no explicit cross feature engineering needed
Scale invariant: No standardization needed for features of different magnitudes
Interpretable: Based on tree internal structure, can do feature importance, SHAP interpretation

Common Efficient Implementations

XGBoost: Second-order approximation, sparsity aware, Block approximate histogram, L1/L2 regularization
LightGBM: GOSS row sampling + EFB feature parallel, leaf-wise growth, high speed + low memory consumption
CatBoost: Ordered target encoding for categorical features, built-in symmetric tree, text feature support

All do large amount of engineering and algorithm acceleration on classic GBDT framework, but core idea (negative gradient + shallow tree stacking) remains unchanged.

When to Use / When Not to Use

Suitable: Structured tabular data, complex nonlinear relationships between features, don’t want to do lots of feature engineering
Not suitable: High-dimensional sparse text, images, speech and other deep learning dominant scenarios; when data volume is huge but features are simple, linear models are faster

Code Example

from xgboost import XGBClassifier

model = XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='binary:logistic',
    eval_metric='auc'
)

model.fit(X_train, y_train,
          eval_set=[(X_val, y_val)],
          early_stopping_rounds=50)