Big Data 205 - Linear Regression Machine Learning Perspective

Overview

Before formally entering regression analysis related algorithm discussion, need to further analyze and understand regression problems in supervised learning algorithms. Although regression problems belong to supervised learning category, actually regression problems are much more complex than classification problems.

First about output result comparison, classification model final output is discrete variable, discrete variable itself contains less information, doesn’t have algebraic operation properties, so its evaluation metric system is also relatively simple, most commonly used are confusion matrix and ROC curve.

While regression problem final output is continuous variable, it not only can represent operations, but also has more exquisite methods, hoping to mine more underlying principles of thing operation. That is, regression problem model more comprehensively and completely describes objective laws of things, thereby able to get more fine-grained conclusions. Therefore, regression problem models are often more complex, data needed for modeling provides more information, thereby more problems may be encountered in modeling process.

Linear Regression and Machine Learning

Linear regression problem is most commonly used algorithm model to solve regression type problems, its algorithm idea and basic principles developed from multivariate statistical analysis, but in data mining and machine learning fields, is also a rare effective algorithm model. On one hand, machine learning ideas embedded in linear regression are worth learning, and over time, many powerful non-linear models were born based on linear regression.

Therefore, during machine learning algorithm learning, still need systematic deep learning of linear regression as statistical analysis algorithm. But need to note, there are many understanding angles for linear regression modeling thought, here we don’t need to understand, master and apply linear regression algorithm from statistical perspective. Many times, using machine learning thinking to understand linear regression will be a better understanding method, this will also be our angle for explaining linear regression in this section.

Machine Learning Representation of Linear Regression

Core Logic

Any machine learning algorithm first has a most fundamental core logic. When using machine learning thinking to understand linear regression, first also need to explore its core logic. Fortunately, although linear regression originates from statistical analysis, its algorithm core logic highly aligns with machine learning algorithms.

In given n attribute-described objective things X = (x1, x2, x3…), each xi is used to describe value attribute of thing in some dimension for one observation.

When establishing machine learning model to capture objective laws of thing operation, essentially want to comprehensively describe thing’s final operation result using these dimension attributes, simplest method to comprehensively synthesize these attributes is weighted sum, this is linear regression equation expression form:

W is collectively called model parameters, where W0 is called intercept, W1-Wn are called regression coefficients, sometimes also use β or θ to represent, actually same nature as y = ax + b we knew since elementary school. Where y is our target variable, also called label. Xi1 are different features on sample i.

If consider we have m samples, regression result can be written:

Where y is column vector containing m sample regression results (structure (m,1), because only one column, expressed as column, so called column vector). Note, we usually use bold lowercase letters to represent vectors, bold uppercase letters to represent matrices or determinants.

Can use matrix to represent this equation, where w can be seen as column matrix of structure (n+1,1), X is feature matrix of structure (m,n+1), then:

Linear regression task is to construct a prediction function to map linear relationship between input feature matrix X and label values y. This prediction function written differently in different textbooks, may write f(x), y_w(x), or h(x) etc, but whatever form, essence of this prediction function is model we need to construct.

Optimization Target

For linear regression, prediction function y = Xw is our model, in machine learning also called “decision function”. Only w is unknown, so core of linear regression principle is finding parameter vector w of model. But how can we solve? Need to rely on concept called: loss function.

In previous algorithm learning, mentioned two model performances: performance on training set, and performance on test set. We model pursuing model performance optimal on test set, so model evaluation metrics often used to measure performance on test set. However, linear regression has need to solve parameters w based on training data, and hope trained model fits training data as much as possible, i.e., model prediction accuracy on training set close to 100% as much as possible.

Therefore, we use loss function as evaluation metric, to measure information loss size when fitting training set with coefficient w, and以此衡量参数 w’s quality. If after modeling with a set of parameters, model performs well on training set, then we say loss small during model fitting process, loss function value small, this set of parameters excellent. Conversely, if model performs badly on training set, loss function will be large, model is undertrained, effect poor, this set of parameters relatively poor.

Even say, when solving parameter w, pursue minimum loss function, make model fitting effect optimal on training data, i.e., prediction accuracy as close to 100% as possible.

For supervised learning algorithms, modeling is based on labeled dataset, regression problems are quantitative discrimination of objective things. Here use yi as label of row i data, yi is continuous variable, xi is vector composed of features on row i, then linear regression modeling optimization direction is hope model discriminated yi^ as close to actual yi as possible. For continuous variables, proximity measurement method can use SSE to calculate, SSE called [Sum of Squared Residuals], also called [Sum of Squared Errors] or [Sum of Squared Deviations], therefore optimization target can be expressed by equation:

Look at simple example, assume w is vector like [1,2], solved model: y = x1 + 2x2;

Then our loss function value is:

Least Squares Method

Now problem transforms to solving parameter vector w that minimizes SSE. This method to solve parameters by minimizing SSE between actual and predicted values is called least squares method.

Review solving process of univariate linear regression:

Fitting above data, can get:

Assume fitting line is y^ = w0 + w1x, now our target is make total residual sum of this fitting line minimum, i.e., minimize SSE.

We let this line be [mean regression] centered on mean, scatter point x mean and y mean must pass through this line:

For actual y, can get:

Here ε is [residual],变形后, residual sum of squares SSE is:

To find minimum residual sum of squares, solve by calculus taking partial derivative for extremum. Here we calculate parameter w1 corresponding to minimum residual:

Solving Parameters for Multivariate Linear Regression

More general case, if in previous example have two columns feature attributes, loss function is:

Such form if expressed by matrix:

Matrix multiplication corresponds to multiplying corresponding unknown elements and adding, will get same result as above formula. We simultaneously take derivative for w:

Finally get:

First derivative after derivation is 0:

At this point, hope to keep w on left side of equation, put all parts related to feature matrix on right side, so can solve optimal w.

Error Quick Reference

Symptom	Root Cause	Fix
Dimension mismatch: Xw can’t multiply / broadcast error	Didn’t merge intercept into features (missing all 1 column); sample/feature dimension definition inconsistent	Clarify X∈ R^{m×(n+1)}, w∈ R^{(n+1)×1}, y∈ R^{m×1}; add bias column to X; unify notation and dimension conventions
Solving normal equation error/not invertible: (X^TX)^{-1} doesn’t exist	Multicollinearity, feature redundancy, insufficient samples causing X^TX singular or ill-conditioned	Check rank/condition number; check highly correlated features; check if m much smaller than n
Coefficient huge, sign abnormal, extremely sensitive to noise	Feature scale differences large; ill-conditioned matrix; outliers pull skewed	Check feature dimensions and distributions; check condition number, outliers
Training fit good but generalization poor	Pursuing minimum SSE leads to overfitting; too many features	Compare train/test error; cross-validation
Loss value “very large/small” hard to interpret	SSE grows with sample size and scale; not directly comparable with MSE/RMSE	Check if using SSE not MSE/RMSE; sample size changes
Derivation steps correct but implementation results inconsistent	Derivation assumptions inconsistent with implementation: whether centered, whether intercept included, whether solved once in matrix form	Align: whether added bias, whether did centering/standardization, whether loss definition consistent