Big Data 207 - How to Handle Multicollinearity

TL;DR

Scenario: When using least squares method to solve linear regression, multicollinearity affects model stability, causing unreliable coefficient estimates
Conclusion: Solutions include using regularization techniques (like Ridge Regression, Lasso Regression)
Output: By introducing regularization term, can effectively avoid matrix non-invertible problem, stabilize regression results

Version Matrix

Function	scikit-learn Version	Description
Linear Regression	0.24+	Supports standard linear regression model training and evaluation
Multicollinearity Handling	0.24+	Solve collinearity through Ridge Regression, Lasso Regression methods
MSE Calculation	0.24+	Use mean_squared_error to calculate model error
R² Calculation	0.24+	Use r2_score to evaluate model fit

Code Implementation

Use scikit-learn algorithm library to implement linear regression algorithm and calculate corresponding evaluation metrics. Review related knowledge introduced earlier for below calculations.

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(data.iloc[:,:-1].values,data.iloc[:,-1].values)
reg.coef_ # View equation coefficients
reg.intercept_ # View intercept

Compare with manually calculated ws, results highly consistent. Then calculate model MSE and discriminant coefficient:

from sklearn.metrics import mean_squared_error,r2_score
yhat = reg.predict(data.iloc[:,:-1])
mean_squared_error(y,yhat)
r2_score(y,yhat)

Multicollinearity

Although during linear regression solving, can quickly find global optimal solution using least squares method, least squares method itself has strict usage conditions. This method requires design matrix X to meet strict conditions to ensure solution stability and uniqueness.

Specifically, least squares method application must meet these key conditions:

Matrix X^TX must be full rank matrix (i.e., determinant not zero)
Matrix X^TX must be invertible or have generalized inverse matrix

In practical applications, often encounter situations causing least squares method to fail:

When sample size n less than feature count p (n<p), matrix X^TX must be singular
When columns of X have strict or approximate linear correlation (i.e., multicollinearity problem)
When data contains highly correlated feature variables

Especially when encountering multicollinearity problem, least squares method solving faces these challenges:

Parameter estimates become very unstable, small data perturbations may cause estimate changes
Parameter estimate variance increases abnormally
Solution may not be unique, infinite solutions exist

For example, in economic data analysis, when using multiple highly correlated indicators like GDP, resident income and consumption expenditure as predictive variables, above problems easily appear. At this time traditional OLS (Ordinary Least Squares) estimation cannot give reliable results.

To address these problems, statisticians have developed various improved methods:

Ridge Regression
Lasso Regression
Principal Component Regression (PCR)
Partial Least Squares Regression (PLSR)

These methods through introducing regularization terms or dimensionality reduction techniques can effectively handle matrix non-invertible and multicollinearity problems, obtaining more stable reliable regression results.

Note that during data collection, dataset columns are objective descriptions of same objective thing, multicollinearity is hard to avoid, so collinearity existence is general situation for many datasets. More extreme case is dataset has more columns than rows, at this time least squares method cannot solve. Therefore, seeking optimization solutions for linear regression algorithms is necessary.

First understand multicollinearity. In section 2 we derived multivariate linear regression least squares solving principle. We took derivative of multivariate linear regression loss function, obtained formula and process for solving coefficient w.

In final step need to left-multiply inverse of XTX, and matrix existence necessary condition is features not having multicollinearity.

Necessary and Sufficient Condition for Matrix Existence

First need to understand significance and impact of inverse matrix existence. When does a matrix have inverse? According to inverse matrix theorem, if |A| != 0, matrix A is invertible.

Inverse matrix existence necessary and sufficient condition: matrix determinant cannot be 0, for linear regression, i.e., |X^TX| cannot be 0, this is one of core conditions for using least squares to solve linear regression.

Necessary and Sufficient Condition for Determinant Not Zero

What conditions need to satisfy for determinant not zero? Here review basic knowledge in linear algebra. Assume feature matrix X structure is (m,n), then XTX is matrix of structure (n,m), resulting in matrix of (n,n).

Any matrix can have determinant, take 3*3 determinant as example. Feature matrix with three features has six cross terms, in reality our feature matrix cannot be such low dimension, so using this method to calculate becomes extremely difficult. In linear algebra, can integrate determinant into trapezoidal determinant through determinant calculation.

Trapezoidal determinant shows all numbers integrated to upper or lower part of diagonal (usually upper), although specific numbers change (e.g., from x11 to a11), determinant value doesn’t change during elementary row transformation. For trapezoidal determinant, determinant calculation much easier.

Not hard to find, because trapezoidal determinant lower part is 0, entire matrix determinant is actually product of diagonal elements. And at this time, as long as any diagonal element is 0, entire determinant will be 0, so as long as no diagonal element is 0, determinant won’t be 0. Here introduce important concept: full rank matrix.

Necessary and Sufficient Condition for Matrix Full Rank

For matrix to be full rank, after transforming to trapezoidal matrix, there should be no 0 on diagonal. What kind of matrix has no 0 on diagonal?

Can do elementary row and column transformations on matrix, including swapping row/column order, multiplying row/column by constant then adding to another row/column, to transform matrix to trapezoidal matrix.

Matrix A obviously not full rank, has all-zero rows so determinant is 0, while matrices B and C have no all-zero rows so full rank.

Difference between Matrix A and Matrix B is: A has two rows with completely linear relationship (1,1,2 and 2,2,4), while B and C have no such two rows.

Matrix B although each diagonal element not 0, has element very close to 0: 0.02, while Matrix C diagonal has no element particularly close to 0.

Relationship between first row and third row in Matrix A is called: exact correlation relationship, i.e., completely related, one row can make another 0. Under this exact correlation relationship, Matrix A determinant is 0, so Matrix A inverse doesn’t exist. In our least squares method, if such exact correlation relationship exists in XTX matrix, inverse doesn’t exist, least squares completely cannot use, linear regression cannot get result.

Relationship between first row and third row in Matrix B is different, they are very close to exact correlation relationship but not completely related, one row cannot make another 0, this relationship called highly correlated relationship. Under this highly correlated relationship, matrix determinant not 0 but very close to 0, Matrix A inverse exists, but approaches infinity. At this time, least squares can be used, but obtained inverse will be very large, directly affecting our parameter vector w solving.

Parameter vector w solved this way will be large, thus affecting modeling results, causing model bias or unusable. Exact correlation relationship and highly correlated relationship together called multicollinearity. Under multicollinearity, model cannot be established, or model unusable.

Conversely, rows of Matrix C result mutually independent, trapezoidal matrix looks very normal, its diagonal has no element particularly close to 0, so its determinant won’t be close to 0 or 0, therefore parameter vector w from Matrix C won’t have too much bias, for fitting relatively ideal.

From all above process can see, matrix to be full rank requires each vector between matrices not have multicollinearity, this also constitutes requirement for linear regression algorithm on feature matrix.

Error Quick Reference

Symptom	Root Cause	Fix
Model parameter estimation unstable	Multicollinearity in data causes design matrix X^T X non-invertible	Check if linear correlation between feature variables; use Ridge Regression, Lasso Regression regularization methods to stabilize solution
Model variance too large	Matrix X^T X nearly singular or collinearity	Check matrix determinant, confirm close to zero; introduce regularization term like Ridge Regression to reduce parameter variance
Prediction error large	Non-uniqueness of solution due to multicollinearity	Check if feature variables highly correlated, especially in input dataset; use Principal Component Regression (PCR) or Partial Least Squares Regression (PLSR)
Solution not unique	Matrix X^T X is singular, cannot invert	Check if feature count in input dataset greater than sample count (n<p); reduce feature count, or use regularization methods to avoid problem