Big Data 215 - sklearn KMeans Attributes and Evaluation

TL;DR

Scenario: Using sklearn for KMeans clustering, want to explain centroids/loss and use metrics for K selection.

Conclusion: inertia_ can only be “smaller is better but not comparable”; for K selection, look at silhouette_score peak; fix idxmin/variable name mix-up in code.

Output: Reusable template of “attribute explanation + K selection curve + version differences + error troubleshooting”.

sklearn KMeans Implementation

cluster.cluster_centers_

centroid = cluster.cluster_centers_
centroid
centroid.shape

cluster.inertia_

View total squared distance sum:

inertia = cluster.inertia_
inertia

What happens to Inertia if we change the number of clusters to 4?

n_clusters = 4
cluster_ = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
inertia_ = cluster_.inertia_
inertia_

And if changed to 5?

n_clusters = 5
cluster_ = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
inertia_ = cluster_.inertia_
inertia_

Clustering Algorithm Model Evaluation: Silhouette Score

Unlike classification and regression models, evaluating clustering algorithm models is not simple. Model evaluation is one of the most important steps in cluster analysis.

How to Measure Clustering Effectiveness

Clustering results don’t have label outputs, and clustering results are uncertain—their quality depends on business or algorithm requirements. We use silhouette score to measure clustering effectiveness.

Silhouette Score

Silhouette score is defined for each sample and can simultaneously measure:

  • a: Sample’s similarity to other samples in its own cluster, equal to average distance between the sample and all other points in the same cluster
  • b: Sample’s similarity to samples in other clusters, equal to average distance between the sample and all points in the nearest other cluster

Silhouette score for a single sample:

s = (b - a) / max(a, b)

Silhouette score range is (-1, 1):

  • Silhouette score close to 1: Sample is very similar to samples in its own cluster and not similar to samples in other clusters
  • Silhouette score 0: Samples in both clusters have equal similarity, the two clusters should be one cluster
  • Negative silhouette: Sample point is more similar to samples outside its cluster

sklearn Silhouette Score

from sklearn.metrics import silhouette_score
from sklearn.metrics import silhouette_samples

# Calculate mean silhouette score for all samples
silhouette_score(X, cluster.labels_)

# Calculate silhouette score for each sample
silhouette_samples(X, cluster.labels_)

Observe Silhouette Score Under Different K

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Define list to store scores
score = []

# Perform KMeans clustering and calculate silhouette_score
for i in range(2, 100):
    cluster = KMeans(n_clusters=i, random_state=0).fit(X)
    score.append(silhouette_score(X, cluster.labels_))

# Plot silhouette_score change curve
plt.plot(range(2, 100), score)

# Find index corresponding to minimum silhouette_score and draw vertical line
plt.axvline(pd.DataFrame(score).idxmin()[0] + 2, ls=':')

plt.title('Silhouette Score vs Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.show()

Important Notes

inertia_ Characteristics

  • Decreases monotonically as K increases
  • No comparable upper bound across datasets
  • Strongly depends on feature scale and dimension
  • Use mainly for Elbow Method reference

Silhouette Score Characteristics

  • Range: [-1, 1], higher is better
  • Can identify when mean ≠ best (e.g., cluster splitting, overlap, outliers)
  • More reliable than inertia for K selection
  • Computationally more expensive

K Selection Best Practice

  1. Generate candidate K values
  2. Calculate silhouette_score for each K
  3. Choose K with highest silhouette_score
  4. Validate with business requirements

Error Quick Reference

SymptomRoot CauseFix
silhouette_score error or abnormal resultVariable mix-up: cluster vs cluster_Unify variable names
K selection line falls on “worst point” instead of “best point”Used idxmin()Change to idxmax()
Same data different results on different machines/versionssklearn 1.4+ n_init defaults to ‘auto’Explicitly specify n_init, init, random_state
inertia very large, elbow not obviousFeatures not standardized/large scale differencesStandardize/normalize first
silhouette_score calculation very slowK scan range too largeNarrow K search range
Many negative silhouette valuesK inappropriate/cluster overlap severeAdjust K, change feature engineering