Big Data 212 - K-Means Clustering Guide
Basic Concepts
This article introduces K-Means clustering, covering key differences between supervised and unsupervised learning, and how clustering is applied in engineering scenarios like customer segmentation, RFM analysis, image/voice/video compression, and anomaly detection.
Difference Between Supervised and Unsupervised Learning
- Supervised Learning: Requires labeled data (Y), with the model learning the mapping from features to labels
- Unsupervised Learning: No labels required, finding patterns or structures in data automatically
Clustering is one of the main tasks in unsupervised learning, used to divide data into meaningful groups (clusters).
K-Means Basic Principles
Key Concepts: Clusters and Centroids
- Cluster (Cluster): A collection of similar data points
- Centroid: The center point of all data points in a cluster
Algorithm Objective
Through iterative optimization, make data points within the same cluster as close as possible to each other, while data points in different clusters are as far apart as possible.
Determining K Value
- Elbow Method: Plot Inertia vs K, find the “elbow point” where decrease slows
- Silhouette Score: Measures cluster cohesion and separation (higher is better)
- Business Constraints: Choose K based on actual business requirements
Working Process
- Initialize K centroids: Randomly select K data points as initial centroids
- Assign each data point: Calculate distance from each point to all centroids, assign to nearest cluster
- Recalculate centroids: Compute mean of all points in each cluster as new centroid
- Iterate until convergence: Repeat steps 2-3 until centroids no longer change or change minimally
Within-Cluster Sum of Squares (Inertia)
Inertia measures the total squared distance from each point to its assigned centroid:
Inertia = Σ Σ ||x_i - μ_c||
Where:
- First sum is over all clusters
- Second sum is over all points in each cluster
- μ_c is the centroid of cluster c
Lower Inertia indicates better clustering (points closer to centroids), but it has limitations:
- Not comparable across different datasets
- Decreases monotonically with increasing K
- Strongly dependent on feature scale and dimensionality
Choosing K Value in Practice
Elbow Method
Plot Inertia vs number of clusters K. The curve typically shows a rapid decrease initially, then flattens out. The “elbow point” where the decrease slows is often a good choice for K.
Silhouette Score
Measures both cluster cohesion and separation:
- Range: [-1, 1]
- Higher is better
- Can identify cases where mean silhouette ≠ best K
Business Constraints
Sometimes business requirements dictate K:
- Customer segmentation into 3-5 groups
- Product categorization into specific number of categories
K-Means Limitations
K-Means has several sensitivities:
- Initialization Sensitivity: Different initial centroids lead to different results
- Scale/Dimension Sensitivity: Features must be normalized/standardized
- Outlier Sensitivity: Outliers can significantly affect centroid calculation
- Non-convex Clusters: Cannot handle clusters with complex shapes well
- Must Specify K: Requires knowing or estimating the number of clusters upfront
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| Points all assigned to one cluster | Initialization resulted in poor centroids | Try different random_state; use k-means++ initialization |
| Inertia very large, elbow not obvious | Features not standardized, large scale differences | Standardize/normalize features first |
| Different results each run | Random initialization | Fix random_state; increase n_init |
| Poor cluster separation | K value inappropriate for data distribution | Use silhouette score to find optimal K |
| Clusters too small or too large | K not matching data structure | Adjust K based on business requirements |
Answer
Based on the article content, the disadvantages of K-Means include:
- a. Lower efficiency on large datasets (disadvantage)
- b. High time complexity (disadvantage)
- c. (Option incomplete, but based on the question “which is NOT a disadvantage of k-means”, answer should be c)