Big Data 215 - sklearn KMeans Attributes and Evaluation

TL;DR

Scenario: Using sklearn for KMeans clustering, want to explain centroids/loss and use metrics for K selection.

Conclusion: inertia_ can only be “smaller is better but not comparable”; for K selection, look at silhouette_score peak; fix idxmin/variable name mix-up in code.

Output: Reusable template of “attribute explanation + K selection curve + version differences + error troubleshooting”.

sklearn KMeans Implementation

cluster.cluster_centers_

centroid = cluster.cluster_centers_
centroid
centroid.shape

cluster.inertia_

View total squared distance sum:

inertia = cluster.inertia_
inertia

What happens to Inertia if we change the number of clusters to 4?

n_clusters = 4
cluster_ = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
inertia_ = cluster_.inertia_
inertia_

And if changed to 5?

n_clusters = 5
cluster_ = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
inertia_ = cluster_.inertia_
inertia_

Clustering Algorithm Model Evaluation: Silhouette Score

Unlike classification and regression models, evaluating clustering algorithm models is not simple. Model evaluation is one of the most important steps in cluster analysis.

How to Measure Clustering Effectiveness

Clustering results don’t have label outputs, and clustering results are uncertain—their quality depends on business or algorithm requirements. We use silhouette score to measure clustering effectiveness.

Silhouette Score

Silhouette score is defined for each sample and can simultaneously measure:

a: Sample’s similarity to other samples in its own cluster, equal to average distance between the sample and all other points in the same cluster
b: Sample’s similarity to samples in other clusters, equal to average distance between the sample and all points in the nearest other cluster

Silhouette score for a single sample:

s = (b - a) / max(a, b)

Silhouette score range is (-1, 1):

Silhouette score close to 1: Sample is very similar to samples in its own cluster and not similar to samples in other clusters
Silhouette score 0: Samples in both clusters have equal similarity, the two clusters should be one cluster
Negative silhouette: Sample point is more similar to samples outside its cluster

sklearn Silhouette Score

from sklearn.metrics import silhouette_score
from sklearn.metrics import silhouette_samples

# Calculate mean silhouette score for all samples
silhouette_score(X, cluster.labels_)

# Calculate silhouette score for each sample
silhouette_samples(X, cluster.labels_)

Observe Silhouette Score Under Different K

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Define list to store scores
score = []

# Perform KMeans clustering and calculate silhouette_score
for i in range(2, 100):
    cluster = KMeans(n_clusters=i, random_state=0).fit(X)
    score.append(silhouette_score(X, cluster.labels_))

# Plot silhouette_score change curve
plt.plot(range(2, 100), score)

# Find index corresponding to minimum silhouette_score and draw vertical line
plt.axvline(pd.DataFrame(score).idxmin()[0] + 2, ls=':')

plt.title('Silhouette Score vs Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.show()

Important Notes

inertia_ Characteristics

Decreases monotonically as K increases
No comparable upper bound across datasets
Strongly depends on feature scale and dimension
Use mainly for Elbow Method reference

Silhouette Score Characteristics

Range: [-1, 1], higher is better
Can identify when mean ≠ best (e.g., cluster splitting, overlap, outliers)
More reliable than inertia for K selection
Computationally more expensive

K Selection Best Practice

Generate candidate K values
Calculate silhouette_score for each K
Choose K with highest silhouette_score
Validate with business requirements

Error Quick Reference

Symptom	Root Cause	Fix
silhouette_score error or abnormal result	Variable mix-up: cluster vs cluster_	Unify variable names
K selection line falls on “worst point” instead of “best point”	Used idxmin()	Change to idxmax()
Same data different results on different machines/versions	sklearn 1.4+ n_init defaults to ‘auto’	Explicitly specify n_init, init, random_state
inertia very large, elbow not obvious	Features not standardized/large scale differences	Standardize/normalize first
silhouette_score calculation very slow	K scan range too large	Narrow K search range
Many negative silhouette values	K inappropriate/cluster overlap severe	Adjust K, change feature engineering