Basic Introduction
Logistic Regression is a classification model in machine learning. Although the name includes “regression,” logistic regression is actually a classification algorithm. Due to its simplicity and efficiency, it is widely used in practice.
Application Scenarios
- Ad click-through rate
- Whether it is spam email
- Whether there is disease
- Financial fraud
- Fake accounts
Looking at the examples above, you can find their common characteristic: they are all judgments between two categories. Logistic regression is a powerful tool for solving binary classification problems.
Logistic Regression Principle
To master logistic regression, you must understand two points:
- What is the input value in logistic regression
- How to determine the output of logistic regression
Input Function
The input of logistic regression is the result of linear regression.
Activation Function
Sigmoid function:
Judgment Criteria
- Input regression result into sigmoid function
- Output result: a probability value in the range [0, 1], default threshold is 0.5
The final classification of logistic regression is determined by the probability value of belonging to a certain category, and this category is marked as 1 by default, while the other category is marked as 0.
Output result explanation: Suppose there are two categories A, B, and assume our probability value is the probability of belonging to category A (1). Now there is a sample input to logistic regression that outputs 0.55, then this probability exceeds 0.5, meaning our training or prediction result is category A (1). Conversely, if the result is 0.3, then the training or prediction result is category B (0).
The threshold for logistic regression can be changed. For example, in the above example, if you set the threshold to 0.6, then output result 0.55 would belong to category B.
Loss and Optimization
The loss of logistic regression is called log-likelihood loss, formula is as follows:
Separate by category:
Where Y is the true value, hθ(x) is the predicted value
How to understand a single formula? This should be understood based on the function image of Log.
No matter when, we hope the loss function value is as small as possible.
Discuss by case, corresponding loss function values:
- When y=1, the larger hθ(x) value, the better
- When y=0, we hope hθ(x) value is as small as possible
- Comprehensive complete loss function
Optimization Logic
Similarly use gradient descent optimization algorithm to reduce the loss function value, so as to update the weight parameters corresponding to the front of logistic regression, increase the probability of originally belonging to category 1, and decrease the probability of originally belonging to category 0.
Case Test
Data Preparation
wget https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv -O pima.csv
Write Code
package icu.wzk.logic
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.{SparkConf, SparkContext}
object LogicTest {
def main(args: Array[String]): Unit = {
// ① Run Demo in local mode; change master for production
val conf = new SparkConf()
.setAppName("LogisticRegression-RDD")
.setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val raw = sc.textFile("pima.csv")
val points = raw.map { line =>
val cols = line.split(",").map(_.toDouble)
LabeledPoint(cols(8), Vectors.dense(cols.slice(0, 8)))
}.cache()
// ③ Train-test split
val Array(train, test) = points.randomSplit(Array(0.8, 0.2), seed = 42)
// ④ Train LR+SGD with 100 iterations
val model = LogisticRegressionWithSGD.train(train, numIterations = 100)
// ⑤ Predict + simple accuracy
val predictAndLabel = test.map(p => (model.predict(p.features), p.label))
val accuracy = predictAndLabel.filter { case (p, l) => p == l }.count().toDouble / test.count()
predictAndLabel.foreach { case (p, l) => println(s"pred=$p\tlabel=$l") }
println(f"accuracy = $accuracy%.4f")
sc.stop()
}
}
Code Explanation:
- Read each line of data from local file pima.csv.
- Each line is separated by comma, converted to array cols.
- Assume each line has 9 values: first 8 are features, the 9th (cols(8)) is the label.
- Construct LabeledPoint object, this is the format in Spark MLlib for representing training samples (includes features and label).
- .cache() caches data in memory to speed up subsequent training.
- Split entire dataset randomly into 80% (training) + 20% (testing).
- Use fixed random seed 42 to ensure consistent results each run.
- Use LogisticRegressionWithSGD (logistic regression based on stochastic gradient descent) to train on training set data.
- Set iteration count to 100, model will train 100 iteration steps to approach optimal solution.
- Predict each sample in test set, returns tuple (predicted value, actual label).
- Use simple equality comparison to determine if prediction is correct.
- Calculate proportion of correct predictions to total test samples to get prediction accuracy.
- Output each prediction result and corresponding label to terminal for manual comparison.
- Print final accuracy with 4 decimal places.
Final accuracy is approximately: 76.62%