Big Data 206 - NumPy Matrix Multiplication Hand-written Multivariate Linear Regression
TL;DR
- Scenario: Without sklearn, implement linear regression through matrix operations using pandas+NumPy and evaluate fitting effect
- Conclusion: Normal equation w=(X^TX)^{-1}X^Ty quickly gives weights; singular matrix needs handling; R² intuitively judges fit quality
- Output: Runnable regression solving function + SSE calculation function + R² calculation function + visualization fitting curve
Multivariate Linear Regression Implementation
Using Matrix Multiplication to Write Regression Algorithm
Writing multivariate linear regression execution function is not complex, mainly involves large amount of matrix calculations, need to use NumPy matrix data format to complete.
First execute standard imports:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Write linear regression function, similarly assume input dataset is DataFrame, and last column is label value:
def standRegres(dataSet):
xMat = np.mat(dataSet.iloc[:,:-1].values) # Extract features
yMat = np.mat(dataSet.iloc[:,-1].values).T # Extract labels
xTx = xMat.T*xMat
if np.linalg.det(xTx) == 0:
print('This matrix is singular,cannot do inverse')
return
ws = xTx.I * (xMat.T*yMat)
return ws
Need to check if XTX is full rank matrix, if not full rank, cannot invert matrix. After defining function, can test running effect. Here we establish linear random numbers for multivariate linear regression equation fitting. Need to note, when using matrix decomposition to solve multivariate linear regression equation, must add one column of all 1s to represent intercept W0.
rng = np.random.RandomState(1) # Set random seed
x = 5*rng.rand(100) # 100 random numbers in [0,5)
y = 2*x-5+rng.randn(100) # True rule label values
X = pd.DataFrame(x)
Y = pd.DataFrame(y)
ex = pd.DataFrame(np.ones([100,1])) # Add column of all 1s to represent intercept
data = pd.concat([ex,X,Y],axis=1)
Data meets basic modeling requirements, then execute function:
ws = standRegres(data)
ws
Return result is weights for each column feature. Since first column values are all 1, first component of return represents intercept. Then can visualize model fitting effect:
yhat = data.iloc[:,:-1].values*ws # Predicted label values
plt.plot(data.iloc[:,1],data.iloc[:,2],'o') # Original data points
plt.plot(data.iloc[:,1],yhat) # Fitting line
Algorithm Evaluation Metrics
Residual sum of squares SSE, calculation formula:
Where m is dataset record count, yi is actual label value, ŷ is predicted value, can use following expression to calculate:
y = data.iloc[:,-1].values
yhat = yhat.flatten()
SSE = np.power(yhat-y,2).sum()
print(SSE)
Can also write as complete function, for reusability set input parameters as dataset and regression method:
def sseCal(dataSet,regres):
n = dataSet.shape[0]
y = dataSet.iloc[:,-1].values
ws = regres(dataSet)
yhat = dataSet.iloc[:,:-1].values*ws
yhat = yhat.flatten()
SSE = np.power(yhat-y,2).sum()
return SSE
Test running:
sseCal(data,standRegres)
Also, in regression algorithm, to eliminate impact of dataset scale on residual sum of squares, often calculate mean residual MSE:
Where m is dataset sample count, and RMSE error root mean square, is square root after opening MSE. Besides, most commonly used is R-square coefficient of determination. Calculation of coefficient of determination needs concepts of sum of squares between groups and sum of squared deviations introduced earlier.
In regression analysis, SSR represents similar between-groups sum of squares concept in clustering, translated: Sum of squares of the regression, composed of sum of squared differences between predicted data and label mean.
And SST (Total sum of squares) is sum of squared differences between actual values and mean:
Comparing clustering analysis introduced earlier, can compare as point clustering. Similarly to silhouette coefficient, final coefficient of determination table also combines “within-cluster error” and “between-cluster error” two metrics.
Coefficient of determination R² measures fitting degree of regression line to observed data:
- If all observed points fall on line, SSE = 0, R² = 1, fit is complete
- If y change has nothing to do with x, completely unhelpful in explaining y variation, R² = 0.
Visible R² value range is [0,1]:
- R² closer to 1 indicates regression sum of squares accounts for larger proportion of total sum of squares, regression line closer to each observed point, x change explains more of y value variation (variation called deviation), regression line fitting degree better.
- Conversely, R² closer to 0, regression line fitting degree worse.
Next, try calculating coefficient of determination for above fitting result:
sse = sseCal(data,standRegres)
y = data.iloc[:,-1].values
sst = np.power(y-y.mean(),2).sum()
1-sse/sst
Result is 0.91, can see final fitting effect is very good. Can also write function to encapsulate coefficient of determination related operations, also leave adjustment regression function interface.
def rSquare(dataSet,regres):
sse = sseCal(dataSet,regres)
y = dataSet.iloc[:,-1].values
sst = np.power(y-y.mean(),2).sum()
return 1-sse/sst
Then test:
rSquare(data,standRegres)
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
Directly reports SyntaxError (Chinese “矩阵,无法求解逆矩阵” in regression function) | Non-comment Chinese text mixed in function body standRegres() after print(...) | Delete that line or change to comment # ... |
IndentationError / NameError (R² function can’t run) | rSquare function body not indented | Under def rSquare(dataSet,regres): multiple lines indent sse=.../y=.../sst=.../return ... to function body |
| Prints “This matrix is singular,cannot do inverse” and returns empty | X^TX singular (collinear/insufficient samples/duplicate features/intercept column mishandled) | At np.linalg.det(xTx)==0 branch use np.linalg.pinv(xTx) or directly np.linalg.lstsq(X,y); or remove collinear features/regularize |
ValueError: shapes ... not aligned | X and ws dimensions not match (column count inconsistent, label column position wrong) | Ensure yhat = data.iloc[:,:-1].values*ws last column is label; feature column count matches ws row count; whether intercept column added consistent |
| Result weights obviously unreasonable (intercept/coefficient deviates from expectation) | Intercept column not added, features not standardized, column order chaos | After constructing data ex/X/Y and concat clarify column order: [1, x1, x2, ..., y]; if needed standardize features and keep same processing flow |
TypeError/UFuncTypeError (matrix multiplication fails) | DataFrame contains strings/categorical/NaN | Before training np.mat(dataSet.iloc[:,:-1].values) do astype(float), handle missing values (delete rows/fill), encode categorical features first |
| R² negative or far below expectation | Model underfitting, noisy data, irrelevant features; or train/evaluate column misaligned | Check 1 - sse/sst output verify y and yhat one-to-one; check if label column is last column; add effective features or use more suitable model |
| SSE very large but image looks okay | SSE sensitive to sample size and scale | Output SSE = ...sum() also output MSE/RMSE; compare same scale metrics; if needed scale y |