Big Data 206 - NumPy Matrix Multiplication Hand-written Multivariate Linear Regression

TL;DR

  • Scenario: Without sklearn, implement linear regression through matrix operations using pandas+NumPy and evaluate fitting effect
  • Conclusion: Normal equation w=(X^TX)^{-1}X^Ty quickly gives weights; singular matrix needs handling; R² intuitively judges fit quality
  • Output: Runnable regression solving function + SSE calculation function + R² calculation function + visualization fitting curve

Multivariate Linear Regression Implementation

Using Matrix Multiplication to Write Regression Algorithm

Writing multivariate linear regression execution function is not complex, mainly involves large amount of matrix calculations, need to use NumPy matrix data format to complete.

First execute standard imports:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Write linear regression function, similarly assume input dataset is DataFrame, and last column is label value:

def standRegres(dataSet):
    xMat = np.mat(dataSet.iloc[:,:-1].values) # Extract features
    yMat = np.mat(dataSet.iloc[:,-1].values).T # Extract labels
    xTx = xMat.T*xMat
    if np.linalg.det(xTx) == 0:
        print('This matrix is singular,cannot do inverse')
        return
    ws = xTx.I * (xMat.T*yMat)
    return ws

Need to check if XTX is full rank matrix, if not full rank, cannot invert matrix. After defining function, can test running effect. Here we establish linear random numbers for multivariate linear regression equation fitting. Need to note, when using matrix decomposition to solve multivariate linear regression equation, must add one column of all 1s to represent intercept W0.

rng = np.random.RandomState(1) # Set random seed
x = 5*rng.rand(100) # 100 random numbers in [0,5)
y = 2*x-5+rng.randn(100) # True rule label values
X = pd.DataFrame(x)
Y = pd.DataFrame(y)
ex = pd.DataFrame(np.ones([100,1])) # Add column of all 1s to represent intercept
data = pd.concat([ex,X,Y],axis=1)

Data meets basic modeling requirements, then execute function:

ws = standRegres(data)
ws

Return result is weights for each column feature. Since first column values are all 1, first component of return represents intercept. Then can visualize model fitting effect:

yhat = data.iloc[:,:-1].values*ws # Predicted label values
plt.plot(data.iloc[:,1],data.iloc[:,2],'o') # Original data points
plt.plot(data.iloc[:,1],yhat) # Fitting line

Algorithm Evaluation Metrics

Residual sum of squares SSE, calculation formula:

Where m is dataset record count, yi is actual label value, ŷ is predicted value, can use following expression to calculate:

y = data.iloc[:,-1].values
yhat = yhat.flatten()
SSE = np.power(yhat-y,2).sum()
print(SSE)

Can also write as complete function, for reusability set input parameters as dataset and regression method:

def sseCal(dataSet,regres):
    n = dataSet.shape[0]
    y = dataSet.iloc[:,-1].values
    ws = regres(dataSet)
    yhat = dataSet.iloc[:,:-1].values*ws
    yhat = yhat.flatten()
    SSE = np.power(yhat-y,2).sum()
    return SSE

Test running:

sseCal(data,standRegres)

Also, in regression algorithm, to eliminate impact of dataset scale on residual sum of squares, often calculate mean residual MSE:

Where m is dataset sample count, and RMSE error root mean square, is square root after opening MSE. Besides, most commonly used is R-square coefficient of determination. Calculation of coefficient of determination needs concepts of sum of squares between groups and sum of squared deviations introduced earlier.

In regression analysis, SSR represents similar between-groups sum of squares concept in clustering, translated: Sum of squares of the regression, composed of sum of squared differences between predicted data and label mean.

And SST (Total sum of squares) is sum of squared differences between actual values and mean:

Comparing clustering analysis introduced earlier, can compare as point clustering. Similarly to silhouette coefficient, final coefficient of determination table also combines “within-cluster error” and “between-cluster error” two metrics.

Coefficient of determination R² measures fitting degree of regression line to observed data:

  • If all observed points fall on line, SSE = 0, R² = 1, fit is complete
  • If y change has nothing to do with x, completely unhelpful in explaining y variation, R² = 0.

Visible R² value range is [0,1]:

  • R² closer to 1 indicates regression sum of squares accounts for larger proportion of total sum of squares, regression line closer to each observed point, x change explains more of y value variation (variation called deviation), regression line fitting degree better.
  • Conversely, R² closer to 0, regression line fitting degree worse.

Next, try calculating coefficient of determination for above fitting result:

sse = sseCal(data,standRegres)
y = data.iloc[:,-1].values
sst = np.power(y-y.mean(),2).sum()
1-sse/sst

Result is 0.91, can see final fitting effect is very good. Can also write function to encapsulate coefficient of determination related operations, also leave adjustment regression function interface.

def rSquare(dataSet,regres):
    sse = sseCal(dataSet,regres)
    y = dataSet.iloc[:,-1].values
    sst = np.power(y-y.mean(),2).sum()
    return 1-sse/sst

Then test:

rSquare(data,standRegres)

Error Quick Reference

SymptomRoot CauseFix
Directly reports SyntaxError (Chinese “矩阵,无法求解逆矩阵” in regression function)Non-comment Chinese text mixed in function body standRegres() after print(...)Delete that line or change to comment # ...
IndentationError / NameError (R² function can’t run)rSquare function body not indentedUnder def rSquare(dataSet,regres): multiple lines indent sse=.../y=.../sst=.../return ... to function body
Prints “This matrix is singular,cannot do inverse” and returns emptyX^TX singular (collinear/insufficient samples/duplicate features/intercept column mishandled)At np.linalg.det(xTx)==0 branch use np.linalg.pinv(xTx) or directly np.linalg.lstsq(X,y); or remove collinear features/regularize
ValueError: shapes ... not alignedX and ws dimensions not match (column count inconsistent, label column position wrong)Ensure yhat = data.iloc[:,:-1].values*ws last column is label; feature column count matches ws row count; whether intercept column added consistent
Result weights obviously unreasonable (intercept/coefficient deviates from expectation)Intercept column not added, features not standardized, column order chaosAfter constructing data ex/X/Y and concat clarify column order: [1, x1, x2, ..., y]; if needed standardize features and keep same processing flow
TypeError/UFuncTypeError (matrix multiplication fails)DataFrame contains strings/categorical/NaNBefore training np.mat(dataSet.iloc[:,:-1].values) do astype(float), handle missing values (delete rows/fill), encode categorical features first
R² negative or far below expectationModel underfitting, noisy data, irrelevant features; or train/evaluate column misalignedCheck 1 - sse/sst output verify y and yhat one-to-one; check if label column is last column; add effective features or use more suitable model
SSE very large but image looks okaySSE sensitive to sample size and scaleOutput SSE = ...sum() also output MSE/RMSE; compare same scale metrics; if needed scale y