Big Data 212 - K-Means Clustering Guide

Basic Concepts

This article introduces K-Means clustering, covering key differences between supervised and unsupervised learning, and how clustering is applied in engineering scenarios like customer segmentation, RFM analysis, image/voice/video compression, and anomaly detection.

Difference Between Supervised and Unsupervised Learning

Supervised Learning: Requires labeled data (Y), with the model learning the mapping from features to labels
Unsupervised Learning: No labels required, finding patterns or structures in data automatically

Clustering is one of the main tasks in unsupervised learning, used to divide data into meaningful groups (clusters).

K-Means Basic Principles

Key Concepts: Clusters and Centroids

Cluster (Cluster): A collection of similar data points
Centroid: The center point of all data points in a cluster

Algorithm Objective

Through iterative optimization, make data points within the same cluster as close as possible to each other, while data points in different clusters are as far apart as possible.

Determining K Value

Elbow Method: Plot Inertia vs K, find the “elbow point” where decrease slows
Silhouette Score: Measures cluster cohesion and separation (higher is better)
Business Constraints: Choose K based on actual business requirements

Working Process

Initialize K centroids: Randomly select K data points as initial centroids
Assign each data point: Calculate distance from each point to all centroids, assign to nearest cluster
Recalculate centroids: Compute mean of all points in each cluster as new centroid
Iterate until convergence: Repeat steps 2-3 until centroids no longer change or change minimally

Within-Cluster Sum of Squares (Inertia)

Inertia measures the total squared distance from each point to its assigned centroid:

Inertia = Σ Σ ||x_i - μ_c||

Where:

First sum is over all clusters
Second sum is over all points in each cluster
μ_c is the centroid of cluster c

Lower Inertia indicates better clustering (points closer to centroids), but it has limitations:

Not comparable across different datasets
Decreases monotonically with increasing K
Strongly dependent on feature scale and dimensionality

Choosing K Value in Practice

Elbow Method

Plot Inertia vs number of clusters K. The curve typically shows a rapid decrease initially, then flattens out. The “elbow point” where the decrease slows is often a good choice for K.

Silhouette Score

Measures both cluster cohesion and separation:

Range: [-1, 1]
Higher is better
Can identify cases where mean silhouette ≠ best K

Business Constraints

Sometimes business requirements dictate K:

Customer segmentation into 3-5 groups
Product categorization into specific number of categories

K-Means Limitations

K-Means has several sensitivities:

Initialization Sensitivity: Different initial centroids lead to different results
Scale/Dimension Sensitivity: Features must be normalized/standardized
Outlier Sensitivity: Outliers can significantly affect centroid calculation
Non-convex Clusters: Cannot handle clusters with complex shapes well
Must Specify K: Requires knowing or estimating the number of clusters upfront

Error Quick Reference

Symptom	Root Cause	Fix
Points all assigned to one cluster	Initialization resulted in poor centroids	Try different random_state; use k-means++ initialization
Inertia very large, elbow not obvious	Features not standardized, large scale differences	Standardize/normalize features first
Different results each run	Random initialization	Fix random_state; increase n_init
Poor cluster separation	K value inappropriate for data distribution	Use silhouette score to find optimal K
Clusters too small or too large	K not matching data structure	Adjust K based on business requirements

Answer

Based on the article content, the disadvantages of K-Means include:

a. Lower efficiency on large datasets (disadvantage)
b. High time complexity (disadvantage)
c. (Option incomplete, but based on the question “which is NOT a disadvantage of k-means”, answer should be c)