Big Data 209 - Deep Understanding of Logistic Regression

Overview

Classification technology is important component in machine learning and data mining applications. About 70% of problems in data science are classification problems. Many algorithms exist for classification: KNN uses distance calculation to achieve classification; Decision Tree builds intuitive tree for classification. Here we discuss Logistic Regression, a common method for binary classification problems. LR mainly finds optimal parameters to correctly classify original data.

Basic Principles

Logistic Regression (LR) is actually a misleading concept. Although name contains “regression”, it’s best at classification problems. LR classifier suitable for various classification tasks: positive/negative sentiment analysis (binary), user click rate (binary), user default prediction (binary), spam detection (binary), disease prediction (binary), user level classification (multi-class) and other scenarios.

For least squares optimal solution and original algorithm correction when conditions not met, actual linear regression applications introduce many variations. These variations can be unified as:

Where g(_) is differentiable function, this type of model called Generalized Linear Model. Function called link function. Now widely known Logistic Regression is one of many generalized regression algorithms.

In Logistic Regression, use Logistic Function as g⁻¹(_). Logistic Function form:

Can see Logistic Function is a Sigmoid function. Sigmoid function is S-shaped, Logistic Function is important representative of Sigmoid functions, also plays major role in perceptron theory.

Using logarithm function, can convert Z to value in (0,1) interval. Additionally, g(Z) has good derivative property:

Bring into Logistic Regression expression:

Further get:

Since y + (1-y) = 1, can treat y and (1-y) as pair of positive/negative possibilities. Consider y as probability of sample x positive, (1-y) as probability of x negative. Ratio:

Called odds, reflects relative likelihood of sample x being positive. Taking log of odds gives “log odds” (logit):

Thus can see, above formula is actually using linear regression model prediction to approximate log odds of true labels. Therefore corresponding model called “Logistic Regression”. Note although contains “regression”, it’s essentially a classification learning method.

This method has many advantages: directly models classification probability, not only predicts category but also obtains approximate probability prediction, useful for tasks needing probability-aided decisions. Meanwhile Logistic Function is arbitrarily order differentiable convex function, has many mathematical properties, many numerical optimization algorithms can directly use to find optimal solution. Next will use gradient descent to solve, first briefly introduce gradient descent theory.

Gradient Descent

Gradient Descent When solving model parameters for machine algorithms, i.e., unconstrained optimization problems, gradient descent is one of most commonly used methods.

Gradient

In calculus, for multivariate function parameters take partial derivatives, write each parameter’s partial derivative in vector form, that’s gradient. Gradient’s geometric meaning is direction of fastest function increase, along gradient vector direction easier to find function maximum, conversely along gradient vector opposite direction, gradient decreases fastest, easier to find function minimum.

Gradient Descent / Gradient Ascent

In machine learning algorithms, when minimizing loss function, can use gradient descent to iteratively solve, get minimized loss function and model parameter values. Conversely, if need to solve maximum of loss function, need gradient ascent at this time. Gradient descent and gradient ascent can be transformed to each other: if need minimum of loss function, use gradient descent to iterate, but actually can solve maximum of loss function, then use gradient ascent method.

Algorithm Detailed Explanation

First intuitive explanation of gradient descent: like being on a mountain, don’t know how to go down, decide to take one step at a time, each position solve gradient, along gradient negative direction, steepest position to go down one step, then continue solving gradient, at this position along steepest easiest down position go another step.

Keep going down until think reached mountain foot. Could not reach foot but to some local minimum. From above explanation, gradient descent may not find global optimum, could be local optimum. However if loss function is convex, gradient descent definitely gets global optimum.

Before detailed gradient descent algorithm, first look at related concepts:

Step Size (Learning Rate): Step size determines length along gradient negative direction in each iteration. Using mountain descent analogy, step size is length of step along steepest easiest down position.
Hypothesis Function: In supervised learning, used to fit input samples, recorded as ŷ, e.g., for single feature m samples, can use fitting function y = w0 + w1x.
Loss Function: To evaluate model fitting quality, usually use loss function to measure fitting degree. Loss function minimization means best fit, corresponding parameters are optimal parameters. In linear regression, loss function usually is sample output and hypothesis function difference squared. For m samples using linear regression, loss function is: Gradient descent algorithm using matrix is more concise, uses matrix, implementation logic more straightforward. This section requires some matrix analysis foundation, especially matrix differentiation.
Precondition: Need confirm hypothesis function and loss function for optimization model. For linear regression, hypothesis ŷ = w0 + w1x1 + … + wnxn, matrix expression:

Where hypothesis is mx1 vector, w is (n+1)x1 vector with n model parameters, X is m x (n+1) matrix. m is sample count, n+1 is feature count. Loss function expression:

Where Y is sample output vector, dimension m x 1

Algorithm-related parameter initialization: w vector can initialize to default values or tuned values, algorithm termination distance c, step size a initialize to 1, tune during optimization.

Gradient Algorithm Tuning

When using gradient algorithm, what needs tuning?

Algorithm compensation selection: In algorithm description, step size is 1, but actual value depends on data samples. Can take more values, from large to small, run algorithm, see iteration effect. If loss function decreasing, value effective, otherwise increase step size.
Algorithm parameter initial value selection: Different initial values get different minimums, so gradient descent only gets local minimum. Due to local minimum risk, run algorithm multiple times with different initial values, observe loss function minimum, select initial value minimizing loss function.
Standardization: Because sample feature value ranges differ, may cause slow iteration. To reduce feature value impact, can standardize features: for each feature z, calculate expectation and standard deviation std(x), then transform:

New expectation is 0, new variance is 1, no dimension, convergence speed can greatly increase.

Error Quick Reference

Symptom	Root Cause	Fix
Gradient descent cannot converge	Step size too large causing gradient update too fast, skipping optimal solution	Decrease step size or use adaptive step size (like Adam optimizer)
Model parameter optimization inaccurate	Initial parameters not suitable, falling into local optimum	Try different initial parameters, run multiple times to verify results
Still slow convergence after standardization	Feature data standardization not done properly	Confirm standardization method, ensure expectation and variance calculation correct