Big Data 215 - sklearn KMeans Attributes and Evaluation
TL;DR
Scenario: Using sklearn for KMeans clustering, want to explain centroids/loss and use metrics for K selection.
Conclusion: inertia_ can only be “smaller is better but not comparable”; for K selection, look at silhouette_score peak; fix idxmin/variable name mix-up in code.
Output: Reusable template of “attribute explanation + K selection curve + version differences + error troubleshooting”.
sklearn KMeans Implementation
cluster.cluster_centers_
centroid = cluster.cluster_centers_
centroid
centroid.shape
cluster.inertia_
View total squared distance sum:
inertia = cluster.inertia_
inertia
What happens to Inertia if we change the number of clusters to 4?
n_clusters = 4
cluster_ = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
inertia_ = cluster_.inertia_
inertia_
And if changed to 5?
n_clusters = 5
cluster_ = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
inertia_ = cluster_.inertia_
inertia_
Clustering Algorithm Model Evaluation: Silhouette Score
Unlike classification and regression models, evaluating clustering algorithm models is not simple. Model evaluation is one of the most important steps in cluster analysis.
How to Measure Clustering Effectiveness
Clustering results don’t have label outputs, and clustering results are uncertain—their quality depends on business or algorithm requirements. We use silhouette score to measure clustering effectiveness.
Silhouette Score
Silhouette score is defined for each sample and can simultaneously measure:
- a: Sample’s similarity to other samples in its own cluster, equal to average distance between the sample and all other points in the same cluster
- b: Sample’s similarity to samples in other clusters, equal to average distance between the sample and all points in the nearest other cluster
Silhouette score for a single sample:
s = (b - a) / max(a, b)
Silhouette score range is (-1, 1):
- Silhouette score close to 1: Sample is very similar to samples in its own cluster and not similar to samples in other clusters
- Silhouette score 0: Samples in both clusters have equal similarity, the two clusters should be one cluster
- Negative silhouette: Sample point is more similar to samples outside its cluster
sklearn Silhouette Score
from sklearn.metrics import silhouette_score
from sklearn.metrics import silhouette_samples
# Calculate mean silhouette score for all samples
silhouette_score(X, cluster.labels_)
# Calculate silhouette score for each sample
silhouette_samples(X, cluster.labels_)
Observe Silhouette Score Under Different K
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Define list to store scores
score = []
# Perform KMeans clustering and calculate silhouette_score
for i in range(2, 100):
cluster = KMeans(n_clusters=i, random_state=0).fit(X)
score.append(silhouette_score(X, cluster.labels_))
# Plot silhouette_score change curve
plt.plot(range(2, 100), score)
# Find index corresponding to minimum silhouette_score and draw vertical line
plt.axvline(pd.DataFrame(score).idxmin()[0] + 2, ls=':')
plt.title('Silhouette Score vs Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.show()
Important Notes
inertia_ Characteristics
- Decreases monotonically as K increases
- No comparable upper bound across datasets
- Strongly depends on feature scale and dimension
- Use mainly for Elbow Method reference
Silhouette Score Characteristics
- Range: [-1, 1], higher is better
- Can identify when mean ≠ best (e.g., cluster splitting, overlap, outliers)
- More reliable than inertia for K selection
- Computationally more expensive
K Selection Best Practice
- Generate candidate K values
- Calculate silhouette_score for each K
- Choose K with highest silhouette_score
- Validate with business requirements
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| silhouette_score error or abnormal result | Variable mix-up: cluster vs cluster_ | Unify variable names |
| K selection line falls on “worst point” instead of “best point” | Used idxmin() | Change to idxmax() |
| Same data different results on different machines/versions | sklearn 1.4+ n_init defaults to ‘auto’ | Explicitly specify n_init, init, random_state |
| inertia very large, elbow not obvious | Features not standardized/large scale differences | Standardize/normalize first |
| silhouette_score calculation very slow | K scan range too large | Narrow K search range |
| Many negative silhouette values | K inappropriate/cluster overlap severe | Adjust K, change feature engineering |