Big Data 216 - KMeans n_clusters Selection
TL;DR
Scenario: KMeans doesn’t know how many clusters to choose, and different initializations cause result drift.
Conclusion: Use silhouette score to compare candidate k, and use k-means++ with reasonable n_init to solidify stability; note that since scikit-learn 1.4, n_init defaults to auto.
Output: Reusable silhouette analysis chart template + initialization parameter value rules + common error/version difference quick reference.
Case: Selecting n_clusters Based on Silhouette Score
Write Code
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
# Assuming X is already defined dataset
# for loop, cluster count from 2 to 10, step 2
for i in range(2, 10, 2):
# Create figure and subplots
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# Set first subplot x and y axis ranges
ax1.set_xlim([-0.1, 1])
ax1.set_ylim([0, X.shape[0] + (i + 1) * 10])
# Generate KMeans model and fit data
clusterer = KMeans(n_clusters=i, random_state=10)
cluster_labels = clusterer.fit_predict(X)
# Calculate silhouette score
silhouette_avg = silhouette_score(X, cluster_labels)
sample_silhouette_values = silhouette_samples(X, cluster_labels)
print(f"Number of clusters = {i}, Mean silhouette score = {silhouette_avg}")
y_lower = 10
for j in range(i):
# Get silhouette values for j-th cluster
ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == j]
ith_cluster_silhouette_values.sort()
size_cluster_j = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_j
color = cm.nipy_spectral(float(j) / i)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
ith_cluster_silhouette_values,
facecolor=color, alpha=0.5)
ax1.text(-0.05, y_lower + 0.5 * size_cluster_j, str(j))
y_lower = y_upper + 10
ax1.set_title("Silhouette Plot for Different Clusters")
ax1.set_xlabel("Silhouette Coefficient Value")
ax1.set_ylabel("Cluster Label")
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([])
# Subplot 2: Scatter plot of clusters
colors = cm.nipy_spectral(cluster_labels.astype(float) / i)
ax2.scatter(X[:, 0], X[:, 1], marker='o', s=8, c=colors)
# Draw cluster centroids
centers = clusterer.cluster_centers_
ax2.scatter(centers[:, 0], centers[:, 1], marker='x', c="red", s=200)
ax2.set_title("Cluster Data Visualization")
ax2.set_xlabel("Feature Space of First Feature")
ax2.set_ylabel("Feature Space of Second Feature")
plt.suptitle(f"KMeans Silhouette Analysis for Sample Data -- Number of Clusters: {i}", fontsize=14, fontweight='bold')
plt.show()
Important Parameters - Initial Centroid Selection
init
Can input “k-means++”, “random”, or an n-dimensional array.
- Method for initializing centroids, default “k-means++”
- Input “k-means++”: A smart method to select initial cluster centers for K-means to speed up convergence
- If an N-dimensional array is input, array shape should be (n_clusters, n_features) and provides initial centroids
n_init
Integer, default 10.
- Number of times KMeans algorithm runs with different random initialization seeds
- Final result will be based on Inertia calculation
- Best output after n_init consecutive runs
random_state
Controls randomness, used for reproducing results.
Important Parameters - Iteration Stopping
-
max_iter (Maximum iterations):
- Default is usually 300 iterations
- Setting this parameter prevents infinite loops
-
tol (Tolerance):
- Defined as the decrease in Inertia between two consecutive iterations
- Default is usually 1e-4
Understanding Silhouette Analysis Charts
The silhouette analysis chart consists of two parts:
-
Silhouette Plot (left): Shows silhouette score for each sample
- Each vertical bar represents one cluster
- Bar width represents number of samples
- Red dashed line is mean silhouette score
-
Cluster Scatter Plot (right): Shows actual cluster distribution
- Points colored by cluster
- X marks cluster centroids
Interpreting Results
- All clusters should have bars extending past the red dashed line
- Bars should have similar widths (balanced clusters)
- No cluster should have predominantly negative values
- When mean silhouette ≠ best, check for:
- Intra-cluster splitting
- Inter-cluster overlap
- Outliers pulling down scores
Version Differences (scikit-learn 1.4+)
Important: Since scikit-learn 1.4 (2024):
n_initdefaults to'auto'instead of 10'auto'automatically chooses based on data- This may cause different results than expected
Recommendation: Always explicitly specify n_init for reproducibility:
KMeans(n_clusters=3, n_init=10, random_state=42)
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| ValueError: Number of labels is 1 | Clustering result has only 1 cluster | Check n_clusters and data; ensure n_clusters>=2 |
| Same k, each run result significantly different | Initialization leads to different local optima | Fix random_state; explicitly increase n_init |
| Think n_init defaults to 10, but result looks like only ran once | scikit-learn 1.4+ n_init defaults to ‘auto’ | Explicitly write n_init=10/20/… |
| ConvergenceWarning: Number of distinct clusters found smaller than n_clusters | Duplicate points in data/too few discrete values | Count unique samples; reduce n_clusters |