Spark MLlib Logistic Regression: Sigmoid, Loss Function a...

Introduction

Logistic Regression is a classification model in machine learning. Despite having “regression” in its name, it is a classification algorithm. Due to its simplicity and efficiency, it is widely used in practice.

Application Scenarios

Ad click-through rate prediction
Spam email identification
Disease diagnosis
Financial fraud detection
Fake account detection

Looking at the examples above, you can see a common characteristic: they all involve judgment between two categories. Logistic regression is the go-to tool for solving binary classification problems.

Logistic Regression Principles

To master logistic regression, two key points must be understood:

What is the input to logistic regression
How to interpret the output of logistic regression

Input Function

The input to logistic regression is the result of a linear regression.

Activation Function

The Sigmoid function:

Decision criteria:

The regression result is fed into the Sigmoid function
Output result: a probability value in the [0, 1] interval, with a default threshold of 0.5

Logistic regression’s final classification is determined by the probability value of belonging to a certain category. This category is labeled 1 by default, and the other category is labeled 0.

Interpreting output results: suppose there are two categories A and B, and the probability value represents the probability of belonging to category A (1). If a sample input to logistic regression outputs 0.55, this probability exceeds 0.5, meaning the training or prediction result is category A (1). Conversely, if the result is 0.3, the training or prediction result is category B (0).

The threshold for logistic regression can be changed. For example, if you set the threshold to 0.6, an output of 0.55 would be classified as category B.

Loss and Optimization

The loss in logistic regression is called log-likelihood loss. The formula is:

Separated by category:

Where Y is the true value and hθ(x) is the predicted value.

How to understand the individual expressions? This needs to be understood through the graph of the Log function.

In all cases, we want the loss function value to be as small as possible.

Discussing by case, the corresponding loss function values:

When y=1, we want hθ(x) to be as large as possible
When y=0, we want hθ(x) to be as small as possible
Combined complete loss function

Optimization Logic

Also uses gradient descent optimization algorithm to reduce the loss function value, updating the weight parameters of the logistic regression algorithm to increase the probability of samples originally belonging to class 1 and decrease the probability of samples originally belonging to class 0.

Case Study

Data Preparation

wget https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv -O pima.csv

Code Implementation

package icu.wzk.logic

import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.{SparkConf, SparkContext}


object LogicTest {
  def main(args: Array[String]): Unit = {

    // ① Local mode for demo; change master for production
    val conf = new SparkConf()
      .setAppName("LogisticRegression-RDD")
      .setMaster("local[*]")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    val raw = sc.textFile("pima.csv")
    val points = raw.map { line =>
      val cols = line.split(",").map(_.toDouble)
      LabeledPoint(cols(8), Vectors.dense(cols.slice(0, 8)))
    }.cache()

    // ③ Train-test split
    val Array(train, test) = points.randomSplit(Array(0.8, 0.2), seed = 42)

    // ④ Train LR+SGD with 100 iterations
    val model = LogisticRegressionWithSGD.train(train, numIterations = 100)

    // ⑤ Predict + simple accuracy
    val predictAndLabel = test.map(p => (model.predict(p.features), p.label))
    val accuracy = predictAndLabel.filter { case (p, l) => p == l }.count().toDouble / test.count()

    predictAndLabel.foreach { case (p, l) => println(s"pred=$p\tlabel=$l") }
    println(f"accuracy = $accuracy%.4f")

    sc.stop()
  }
}

Code explanation:

Reads each line of data from the local file pima.csv.
Each line is comma-separated and converted to an array cols.
Assumes each line has 9 values: the first 8 are features, the 9th (cols(8)) is the label.
Constructs LabeledPoint objects, the format for training samples in Spark MLlib (containing features and label).
.cache() caches the data in memory to speed up subsequent training.
Splits the entire dataset randomly into 80% (training) + 20% (testing).
Uses a fixed random seed of 42 to ensure consistent results across runs.
Uses LogisticRegressionWithSGD (stochastic gradient descent-based logistic regression) to train on the training set.
Sets the number of iterations to 100; the model trains for 100 iteration steps to approach the optimal solution.
Predicts each sample in the test set, returning a (predicted value, actual label) tuple.
Uses simple equality comparison to determine if the prediction is correct.
Calculates the ratio of correct predictions to total test samples to get prediction accuracy.
Outputs each prediction result and corresponding label to the terminal for manual comparison.
Prints the final accuracy to 4 decimal places.

Final accuracy is approximately: 76.62%