Big Data 208 - Ridge Regression and Lasso Regression

TL;DR

  • Scenario: Solve multicollinearity and overfitting problems
  • Conclusion: Ridge Regression suitable for highly correlated features, Lasso Regression suitable for feature selection
  • Output: Application scenarios and regularization parameter λ selection guidance

Version Matrix

VersionStatusDescription
Ridge Regression (L2)VerifiedSuitable for highly correlated feature scenarios, keep all features but compress coefficients
Lasso Regression (L1)VerifiedSuitable for high-dimensional data and feature selection, can compress some coefficients to zero
Elastic NetVerifiedCombines L1 and L2 regularization advantages, suitable for comprehensive scenarios

Ridge Regression and Lasso

Ridge Regression and Lasso (Least Absolute Shrinkage and Selection Operator) are two classic linear regression regularization methods, specifically used to solve overfitting and multicollinearity problems in machine learning. Core idea of both methods is to constrain model parameters by introducing different regularization terms on top of Ordinary Least Squares (OLS).

1. Ridge Regression (L2 Regularization)

  • Add L2 norm penalty term: Add regression coefficient squared sum penalty term in loss function (λ∑β²)
  • Characteristics: All features are retained, but coefficients are compressed
  • Applicable scenarios: When features are highly correlated (like gene expression data)
  • Mathematical expression: min(∑(y_i - ŷ_i)² + λ∑β_j²)

2. Lasso Regression (L1 Regularization)

  • Add L1 norm penalty term: Add regression coefficient absolute value penalty term in loss function (λ∑|β|)
  • Characteristics: Can achieve feature selection, some coefficients compressed to zero
  • Applicable scenarios: When feature dimension is high and feature selection needed (like text classification)
  • Mathematical expression: min(∑(y_i - ŷ_i)² + λ∑|β_j|)

Typical Application Scenarios

  • House price prediction: When house features (area, room count, location, etc.) have multicollinearity
  • Gene data analysis: Processing thousands of gene expression data
  • Financial risk control: Screening key risk factors from hundreds of financial indicators

Differences Between Two Methods

  1. Ridge Regression makes all coefficients shrink but not zero, suitable for keeping all features
  2. Lasso makes some coefficients become zero, achieving feature selection
  3. Elastic Net combines advantages of both, using L1 and L2 regularization simultaneously

Selection Suggestions

  • When prediction performance more important than interpretability, choose Ridge Regression
  • When feature selection and model interpretability needed, choose Lasso
  • In practice, usually need cross-validation to determine optimal regularization parameter λ

Ridge Regression Principle

Ridge Regression adds L2 regularization term on basis of ordinary linear regression, making model more robust. Specifically, Ridge Regression loss function is sum of ordinary least squares loss function and one L2 regularization term. This regularization term is squared sum of regression coefficients can minimize loss while constraining regression coefficient size, avoid too large causing model overfitting.

Basic Principle

Ridge Regression algorithm is actually an improved algorithm for linear regression algorithm limitations. Optimization purpose is to solve coefficient matrix XTX non-invertible problem, objectively also plays role in overcoming multicollinearity in dataset. Ridge Regression approach is very simple: adding a perturbation term to original equation coefficient calculation formula. Originally couldn’t find generalized inverse situation becomes find generalized inverse, making problem stable and solvable.

Ridge Regression adds regularization term to multivariate linear regression loss function, expressed as L2 norm of coefficient w (i.e., squared term of coefficient w) multiplied by regularization coefficient λ. Complete Ridge Regression loss function expression:

Ridge Regression formula:

min(∑(y_i - ŷ_i)² + λ∑w_j²)

Through solving can get:

w = (XTX + λI)^(-1)XTy

This simple action is actually very clever. Adding a full-rank diagonal matrix to original coefficient matrix calculation actually plays two roles:

  1. First makes final calculation result (XTX + λI) full rank, i.e., reduces collinearity impact of original dataset feature columns
  2. Second also相当于 penalizes explanatory power of all feature columns, larger λ stronger penalty effect

As long as (XTX + λI) has inverse, we can solve. Necessary and sufficient condition for a matrix to have inverse is determinant not zero. Assume original feature matrix has collinearity, then our square matrix XTX will not be full rank (has completely zero rows).

At this time square matrix XTX has no inverse, least squares cannot use. After adding λI, our matrix is very different. Final determinant is still trapezoidal determinant, but it no longer has all-zero rows or all-zero columns, unless:

  • λ equals 0
  • Original matrix XTX has row or column with diagonal element -λ and all other elements 0

Otherwise XTX+λI is always full rank. In sklearn, λ value we can freely control, so can make it not 0, avoid first situation.

That is to say, matrix inverse always exists. With this guarantee, w can be written as:

w = (XTX + λI)^(-1)XTy

This way, regularization coefficient λ avoids impact from exact correlation relationship. At least least squares can definitely be used when λ exists. For matrices with highly correlated relationships, can also increase λ to make XTX+λI determinant larger, making inverse matrix smaller, thereby controlling parameter vector w offset. Larger λ, less affected by collinearity.


Lasso Regression Principle

Lasso Regression (Least Absolute Shrinkage and Selection Operator) and Ridge Regression are both regularization variants of linear regression, but they have significant differences in regularization methods:

1. Regularization Term Type

  • Lasso Regression uses L1 regularization, i.e., penalty on sum of coefficient absolute values: λΣ|β_j|
  • Ridge Regression uses L2 regularization, i.e., penalty on sum of coefficient squares: λΣβ_j²

2. Feature Selection Capability

  • Lasso’s L1 penalty causes some coefficients completely shrink to zero, achieving automatic feature selection
  • Ridge Regression’s L2 penalty makes all coefficients tend to become smaller, but won’t become completely zero

3. Mathematical Characteristic Differences

  • Lasso’s solution corresponds geometrically to diamond constraint region vertex
  • Ridge Regression’s solution corresponds to circular constraint region

4. Applicable Scenario Comparison

  • When feature count far greater than sample size (p>>n), Lasso usually performs better
  • When highly correlated features exist, Ridge Regression often more stable

5. Practical Application Examples

  • In gene expression data analysis (usually tens of thousands of gene features), Lasso can effectively identify few key genes
  • In economic time series prediction (highly correlated features), Ridge Regression provides more robust predictions

6. Parameter Selection

Both methods need cross-validation to determine optimal regularization parameter λ, but Lasso’s λ usually needs finer grid search to capture key threshold for feature selection.

Note: In practice, Elastic Net (combining L1 and L2 regularization) is often used as compromise, having advantages of both methods.


Solution Methods

Main methods to solve collinearity:

  1. Method 1: Before modeling, perform correlation test on each feature. If multicollinearity exists, consider further SVD decomposition or PCA on dataset. During SVD or PCA execution, dataset will undergo orthogonal transformation, resulting dataset columns will have no correlation. However this changes dataset structure and feature columns become uninterpretable.

  2. Method 2: Use stepwise regression method to select self-variable with strongest explanatory power for dependent variable, add penalty factor to correlated self-variables to weaken their explanatory power for dependent variable. This method cannot completely avoid multicollinearity but can bypass least squares sensitivity to collinearity to build linear regression model.

  3. Method 3: Modify original algorithm, abandon unbiased estimation requirement for linear equation parameters, make it tolerate multicollinearity in feature columns, and can build model successfully while ensuring SSE takes minimum value as much as possible.

Generally, try to solve with one algorithm if possible, don’t use combination of multiple algorithms. Therefore mainly consider latter two methods. Stepwise regression will be explained in last part of linear regression. Third solution is Ridge Regression algorithm and Lasso algorithm we need to discuss in detail next.

Ridge Regression and Lasso Regression add regularization terms differently, affecting model selection and application. Ridge Regression more suitable for avoiding overfitting caused by coefficients too large, while Lasso has more advantage in feature selection, so usually performs better on high-dimensional data. In practice, can use cross-validation to select most suitable regularization parameter and make choice between Ridge and Lasso, or combine both (Elastic Net) for better model effect.


Error Quick Reference

SymptomRoot CauseFix
Cannot solve modelFeature matrix collinearityXTX not invertible, use Ridge Regression or increase λ value
OverfittingModel complexity too highToo many features without effective selection, use Lasso for feature selection
Insufficient feature selectionRidge Regression cannot compress features to zeroNeed automatic feature selection, use Lasso or Elastic Net