Spark MLlib Ensemble Learning: Random Forest, Bagging and...

Ensemble Learning

Do not rely on single weak model to “solve everything”, but construct a cluster of complementary base learners and let them vote/weight, using “collective wisdom” to improve generalization ability, stability, and robustness.

Basic Definition

Ensemble learning builds several models to solve a single prediction problem. Its working principle is to generate multiple classifiers/models, each independently learning and making predictions, these predictions are finally combined into a combined prediction, which is better than any single classifier’s prediction.

Classification of Ensemble Learning

Task 1: How to optimize training data - mainly used to solve underfitting problem
Task 2: How to improve generalization performance - mainly used to solve overfitting problem

As long as single classifier’s performance is not too bad, ensemble learning result is always better than single classifier.

Bagging

Ensemble Principle

Goal: Classify circles and squares

Collect different datasets: Generate different training datasets through sampling with replacement

Train classifiers: Train independent classifiers on each dataset

Equal voting: Get final result

Random Forest

In machine learning, random forest is a classifier containing multiple decision trees, and its output category is determined by the mode of categories output by all trees.

Random Forest = Bagging + Decision Tree

For example, if you train 5 trees, where 4 trees output True and 1 tree outputs False, then the final voting result is True.

Key steps in random forest construction process:

Randomly select part of samples once, sample with replacement, repeat N times (may have duplicate samples)
Randomly select m features, m << M, build decision tree

Boosting

Basic Concept

Accumulate from weak to strong with learning: in other words, each time a new weak learner is added, overall ability improves.

Representative algorithms: AdaBoost, GBDT, XGBoost, LightGBM

Spark MLlib’s GBT is equivalent to “pure native Java/Scala implemented GBDT”, its functionality and speed still have gap from modern competition-level frameworks (XGBoost / LightGBM / CatBoost). Industry commonly uses two routes to integrate advanced algorithms into Spark Pipeline:

XGBoost4J-Spark: Embed each XGBoost worker into Spark executor, natively supports GPU/CPU distributed; API compatible with ML Pipeline, can work with VectorAssembler, ParamGridBuilder.
LightGBM-Spark (microsoft/synapseml): Based on LightGBM’s gradient histogram, Leaf-wise growth, fast training speed, supports native categorical feature processing and distributed training.

Implementation Process

Train first learner
Adjust data distribution
Train second learner
Adjust data distribution again
Continue cycling, continuously improve