Big Data 212 - K-Means Clustering Guide

Basic Concepts

This article introduces K-Means clustering, covering key differences between supervised and unsupervised learning, and how clustering is applied in engineering scenarios like customer segmentation, RFM analysis, image/voice/video compression, and anomaly detection.

Difference Between Supervised and Unsupervised Learning

  • Supervised Learning: Requires labeled data (Y), with the model learning the mapping from features to labels
  • Unsupervised Learning: No labels required, finding patterns or structures in data automatically

Clustering is one of the main tasks in unsupervised learning, used to divide data into meaningful groups (clusters).

K-Means Basic Principles

Key Concepts: Clusters and Centroids

  • Cluster (Cluster): A collection of similar data points
  • Centroid: The center point of all data points in a cluster

Algorithm Objective

Through iterative optimization, make data points within the same cluster as close as possible to each other, while data points in different clusters are as far apart as possible.

Determining K Value

  • Elbow Method: Plot Inertia vs K, find the “elbow point” where decrease slows
  • Silhouette Score: Measures cluster cohesion and separation (higher is better)
  • Business Constraints: Choose K based on actual business requirements

Working Process

  1. Initialize K centroids: Randomly select K data points as initial centroids
  2. Assign each data point: Calculate distance from each point to all centroids, assign to nearest cluster
  3. Recalculate centroids: Compute mean of all points in each cluster as new centroid
  4. Iterate until convergence: Repeat steps 2-3 until centroids no longer change or change minimally

Within-Cluster Sum of Squares (Inertia)

Inertia measures the total squared distance from each point to its assigned centroid:

Inertia = Σ Σ ||x_i - μ_c||

Where:

  • First sum is over all clusters
  • Second sum is over all points in each cluster
  • μ_c is the centroid of cluster c

Lower Inertia indicates better clustering (points closer to centroids), but it has limitations:

  • Not comparable across different datasets
  • Decreases monotonically with increasing K
  • Strongly dependent on feature scale and dimensionality

Choosing K Value in Practice

Elbow Method

Plot Inertia vs number of clusters K. The curve typically shows a rapid decrease initially, then flattens out. The “elbow point” where the decrease slows is often a good choice for K.

Silhouette Score

Measures both cluster cohesion and separation:

  • Range: [-1, 1]
  • Higher is better
  • Can identify cases where mean silhouette ≠ best K

Business Constraints

Sometimes business requirements dictate K:

  • Customer segmentation into 3-5 groups
  • Product categorization into specific number of categories

K-Means Limitations

K-Means has several sensitivities:

  1. Initialization Sensitivity: Different initial centroids lead to different results
  2. Scale/Dimension Sensitivity: Features must be normalized/standardized
  3. Outlier Sensitivity: Outliers can significantly affect centroid calculation
  4. Non-convex Clusters: Cannot handle clusters with complex shapes well
  5. Must Specify K: Requires knowing or estimating the number of clusters upfront

Error Quick Reference

SymptomRoot CauseFix
Points all assigned to one clusterInitialization resulted in poor centroidsTry different random_state; use k-means++ initialization
Inertia very large, elbow not obviousFeatures not standardized, large scale differencesStandardize/normalize features first
Different results each runRandom initializationFix random_state; increase n_init
Poor cluster separationK value inappropriate for data distributionUse silhouette score to find optimal K
Clusters too small or too largeK not matching data structureAdjust K based on business requirements

Answer

Based on the article content, the disadvantages of K-Means include:

  • a. Lower efficiency on large datasets (disadvantage)
  • b. High time complexity (disadvantage)
  • c. (Option incomplete, but based on the question “which is NOT a disadvantage of k-means”, answer should be c)