Spark MLlib Logistic Regression: Input Function, Sigmoid,...

Basic Introduction

Logistic Regression is a classification model in machine learning. Although the name includes “regression,” logistic regression is actually a classification algorithm. Due to its simplicity and efficiency, it is widely used in practice.

Application Scenarios

Ad click-through rate
Whether it is spam email
Whether there is disease
Financial fraud
Fake accounts

Looking at the examples above, you can find their common characteristic: they are all judgments between two categories. Logistic regression is a powerful tool for solving binary classification problems.

Logistic Regression Principle

To master logistic regression, you must understand two points:

What is the input value in logistic regression
How to determine the output of logistic regression

Input Function

The input of logistic regression is the result of linear regression.

Activation Function

Sigmoid function:

Judgment Criteria

Input regression result into sigmoid function
Output result: a probability value in the range [0, 1], default threshold is 0.5

The final classification of logistic regression is determined by the probability value of belonging to a certain category, and this category is marked as 1 by default, while the other category is marked as 0.

Output result explanation: Suppose there are two categories A, B, and assume our probability value is the probability of belonging to category A (1). Now there is a sample input to logistic regression that outputs 0.55, then this probability exceeds 0.5, meaning our training or prediction result is category A (1). Conversely, if the result is 0.3, then the training or prediction result is category B (0).

The threshold for logistic regression can be changed. For example, in the above example, if you set the threshold to 0.6, then output result 0.55 would belong to category B.

Loss and Optimization

The loss of logistic regression is called log-likelihood loss, formula is as follows:

Separate by category:

Where Y is the true value, hθ(x) is the predicted value

How to understand a single formula? This should be understood based on the function image of Log.

No matter when, we hope the loss function value is as small as possible.

Discuss by case, corresponding loss function values:

When y=1, the larger hθ(x) value, the better
When y=0, we hope hθ(x) value is as small as possible
Comprehensive complete loss function

Optimization Logic

Similarly use gradient descent optimization algorithm to reduce the loss function value, so as to update the weight parameters corresponding to the front of logistic regression, increase the probability of originally belonging to category 1, and decrease the probability of originally belonging to category 0.

Case Test

Data Preparation

wget https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv -O pima.csv

Write Code

package icu.wzk.logic

import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.{SparkConf, SparkContext}


object LogicTest {
  def main(args: Array[String]): Unit = {

    // ① Run Demo in local mode; change master for production
    val conf = new SparkConf()
      .setAppName("LogisticRegression-RDD")
      .setMaster("local[*]")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    val raw = sc.textFile("pima.csv")
    val points = raw.map { line =>
      val cols = line.split(",").map(_.toDouble)
      LabeledPoint(cols(8), Vectors.dense(cols.slice(0, 8)))
    }.cache()

    // ③ Train-test split
    val Array(train, test) = points.randomSplit(Array(0.8, 0.2), seed = 42)

    // ④ Train LR+SGD with 100 iterations
    val model = LogisticRegressionWithSGD.train(train, numIterations = 100)

    // ⑤ Predict + simple accuracy
    val predictAndLabel = test.map(p => (model.predict(p.features), p.label))
    val accuracy = predictAndLabel.filter { case (p, l) => p == l }.count().toDouble / test.count()

    predictAndLabel.foreach { case (p, l) => println(s"pred=$p\tlabel=$l") }
    println(f"accuracy = $accuracy%.4f")

    sc.stop()
  }
}

Code Explanation:

Read each line of data from local file pima.csv.
Each line is separated by comma, converted to array cols.
Assume each line has 9 values: first 8 are features, the 9th (cols(8)) is the label.
Construct LabeledPoint object, this is the format in Spark MLlib for representing training samples (includes features and label).
.cache() caches data in memory to speed up subsequent training.
Split entire dataset randomly into 80% (training) + 20% (testing).
Use fixed random seed 42 to ensure consistent results each run.
Use LogisticRegressionWithSGD (logistic regression based on stochastic gradient descent) to train on training set data.
Set iteration count to 100, model will train 100 iteration steps to approach optimal solution.
Predict each sample in test set, returns tuple (predicted value, actual label).
Use simple equality comparison to determine if prediction is correct.
Calculate proportion of correct predictions to total test samples to get prediction accuracy.
Output each prediction result and corresponding label to terminal for manual comparison.
Print final accuracy with 4 decimal places.

Final accuracy is approximately: 76.62%