Spark MLlib: Bagging vs Boosting Differences and GBDT Gra...

Bagging vs Boosting

Data Sampling

Bagging: samples data for training
Boosting: adjusts data importance based on the previous round’s learning results

Voting

Bagging: all learners vote with equal weight
Boosting: learners vote with weighted importance

Learning Order

Bagging: learning is parallel, each learner has no dependency on others
Boosting: learning is sequential, with a defined order

Primary Purpose

Bagging: mainly used to improve generalization performance and solve overfitting
Boosting: mainly used to improve training accuracy and solve underfitting

GBDT

Basic Introduction

GBDT stands for Gradient Boosting Decision Tree. Among traditional machine learning algorithms, GBDT ranks in the top 3.

Decision Tree

Regardless of whether the problem is regression, binary classification, or multi-class classification, GBDT always uses CART regression trees.

For regression tree algorithms, the most important task is finding the best split point. The candidate split points include all possible values of all features.

In classification trees, the best split point is determined by entropy or Gini coefficient — both measure purity. However, in regression trees, sample labels are continuous values, so entropy-based metrics are no longer appropriate. Instead, squared error is used, which effectively evaluates the goodness of fit.

Regression Decision Tree

Both regression and classification decision trees face two questions:

How to select the split point?
How to determine the output value of a leaf node?

A regression tree corresponds to a partition of the input space (feature space) and output values on each partition unit. In classification trees, information gain and information gain rate from information theory are used to select the best split point.

In regression trees, a heuristic approach is used. Suppose the dataset has n features:

Assume the input space is divided into M units R1, R2, …, Rm. The output value for each region is: cm = avg(yi | xi ∈ Rm) — the average of all y values in that region.

Example:

Suppose we want to perform regression on the ages of residents in a building, dividing it into 3 regions R1, R2, R3. Then R1’s output is the average age of residents in the first column, R2’s output is the average age in the second column, and R3’s output is the average age of the eight residents in the third and fourth columns.

Algorithm Flow

Input: Training dataset D

Output: Regression tree f(x)

In the input space of the training dataset, recursively divide each region into two sub-regions and determine the output value for each sub-region, constructing a binary decision tree:

Select the optimal split feature j and split point s:

Traverse feature j, scan split point s for fixed split feature j, and select the pair (j, s) that minimizes the objective.

Use the selected (j, s) to partition regions and determine output values.
Continue calling steps (1) and (2) on the two sub-regions until the stopping condition is met.
Divide the input space into M regions R1, R2, …, Rm and generate the decision tree.

Test Case

A concrete example to deepen understanding of regression decision trees.

Training Data

See the training data table below.

Calculation Process

Select the optimal split feature j and split point s:

Determine the first question — optimal split feature: in this dataset there is only one feature, so the optimal split feature is naturally X.
Determine the second question: consider 9 split points [1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]. The loss function is defined as the squared loss: Loss(y, f(x)) = (f(x) - y)². Substitute each of the 9 split points into: cm = avg(yi | xi ∈ Rm).

Calculate output values for sub-regions:

For example, taking s = 1.5: R1 = {1}, R2 = {2,3,4,5,6,7,8,9,10}. The output values for these two regions are:

c1 = 5.56
c2 = 7.50

Similarly, output values for other split points can be obtained.

Calculate loss function values to find the optimal split point:

Substituting c1, c2 into the squared loss function Loss(y, f(x)) = (f(x) - y)².

When s = 1.5, compute m(s). Similarly compute for all other split points.

When s = 6.5, m(s) is minimized. Therefore, the first split is [j=x, s=6.5].

Partition regions using (j, s) and determine output values:

Two regions: R1 = {1,2,3,4,5,6}, R2 = {7,8,9,10}
Output values: c1 = 6, c2 = 8.91

Continue splitting R1 with split points [1.5, 2.5, 3.5, 4.5, 5.5]:

At s = 3.5, m(s) is minimized.

Generate the regression tree:

Assuming we stop after generating 3 regions, the final regression tree takes the following form.