Spark MLlib Linear Regression: Scenarios, Loss Function a...

Linear Regression Scenarios

House price prediction
Sales forecasting
Loan amount prediction

Linear Regression Definition

Linear Regression is an analytical method that uses regression equations (functions) to model the relationship between one or more independent variables and a dependent variable.

Characteristics: When there is only one independent variable, it’s called simple regression; when there are multiple independent variables, it’s called multiple regression.

Linear regression includes two main models: linear relationships and non-linear relationships.

Simple Linear Relationship (One Variable)

Example of simple linear relationship.

Multiple Linear Relationship (Multiple Variables)

Example of multiple linear relationship.

Non-linear Relationships

Example of non-linear relationships.

Loss Function and Optimization

Suppose there’s a study score example where the real data follows this relationship:

Real relationship: Final Score = 0.5 × Regular Score + 0.3 × Final Exam Score

Now we hypothesize a relationship:

Predicted Score = 0.45 × Regular Score + 0.2 × Final Exam Score

As you can see, there’s error between the actual result and our prediction.

Since this error exists, how do we measure it?

Loss Function

Total loss function:

∑(yi - h(xi))²

yi is the true value of the i-th training sample
h(xi) is the predicted value from the i-th sample’s feature combination
Also known as Least Squares method

Optimization Algorithms

How to find W in the model to minimize the loss (goal is to find the W value corresponding to minimum loss).

Two commonly used optimization algorithms for linear regression:

Analytical Solution (Normal Equation)
Gradient Descent

Normal Equation:

Understanding: X is the feature matrix, Y is the target matrix - solve directly for the best result.

Disadvantage: When features are too complex, solving becomes too slow or impossible.

Gradient Descent (GD):

The basic idea of gradient descent can be compared to the process of going down a mountain. A person is trapped on a mountain and needs to get down to the lowest point (the valley), but there’s thick fog making visibility very low. Therefore, the downhill path cannot be determined - they must use surrounding information to find the path.

Gradient is an important concept in calculus:

In single-variable functions, the gradient is actually the derivative, representing the slope of the tangent line at a specific point
In multi-variable functions, the gradient is a vector with direction - the gradient’s direction specifies the fastest upward direction at a given point
In calculus, taking partial derivatives of parameters in a multi-variable function and writing them as a vector gives the gradient

The meaning of α: In gradient descent, α is called the learning rate or step size, meaning we can control how far each step goes through α.

Why multiply gradient by a negative sign:

Adding a negative sign before the gradient means moving in the opposite direction of the gradient! As mentioned earlier, the gradient’s direction is actually the fastest upward direction at that point, so naturally it’s the negative gradient direction - hence we need to add the negative sign.

Gradient Descent for Single Variable Functions

Suppose there’s a single-variable function J(θ) = θ²
Initialize at θ0 = 1
Learning rate α = 0.4

After four iterations, we essentially reach the function’s minimum.

Gradient Descent for Multi-Variable Functions

Objective function: J(θ) = θ1² + θ2²
Starting point: θ0 = (1, 3)
Learning rate: α = 0.1
Gradient of the function: J(θ) = <2θ1, 2θ2>

After multiple iterations, gradient descent will approach the function’s minimum point (0, 0).

Comparison: Gradient Descent vs Normal Equation

Feature	Gradient Descent	Normal Equation
Advantages	Works for various scenarios	Direct solution
Disadvantages	Requires iteration	Slow with many features