Ensemble Learning
Do not rely on single weak model to “solve everything”, but construct a cluster of complementary base learners and let them vote/weight, using “collective wisdom” to improve generalization ability, stability, and robustness.
Basic Definition
Ensemble learning builds several models to solve a single prediction problem. Its working principle is to generate multiple classifiers/models, each independently learning and making predictions, these predictions are finally combined into a combined prediction, which is better than any single classifier’s prediction.
Classification of Ensemble Learning
- Task 1: How to optimize training data - mainly used to solve underfitting problem
- Task 2: How to improve generalization performance - mainly used to solve overfitting problem
As long as single classifier’s performance is not too bad, ensemble learning result is always better than single classifier.
Bagging
Ensemble Principle
Goal: Classify circles and squares
Collect different datasets: Generate different training datasets through sampling with replacement
Train classifiers: Train independent classifiers on each dataset
Equal voting: Get final result
Random Forest
In machine learning, random forest is a classifier containing multiple decision trees, and its output category is determined by the mode of categories output by all trees.
Random Forest = Bagging + Decision Tree
For example, if you train 5 trees, where 4 trees output True and 1 tree outputs False, then the final voting result is True.
Key steps in random forest construction process:
- Randomly select part of samples once, sample with replacement, repeat N times (may have duplicate samples)
- Randomly select m features, m << M, build decision tree
Boosting
Basic Concept
Accumulate from weak to strong with learning: in other words, each time a new weak learner is added, overall ability improves.
Representative algorithms: AdaBoost, GBDT, XGBoost, LightGBM
Spark MLlib’s GBT is equivalent to “pure native Java/Scala implemented GBDT”, its functionality and speed still have gap from modern competition-level frameworks (XGBoost / LightGBM / CatBoost). Industry commonly uses two routes to integrate advanced algorithms into Spark Pipeline:
- XGBoost4J-Spark: Embed each XGBoost worker into Spark executor, natively supports GPU/CPU distributed; API compatible with ML Pipeline, can work with VectorAssembler, ParamGridBuilder.
- LightGBM-Spark (microsoft/synapseml): Based on LightGBM’s gradient histogram, Leaf-wise growth, fast training speed, supports native categorical feature processing and distributed training.
Implementation Process
- Train first learner
- Adjust data distribution
- Train second learner
- Adjust data distribution again
- Continue cycling, continuously improve