Big Data 216 - KMeans n_clusters Selection

TL;DR

Scenario: KMeans doesn’t know how many clusters to choose, and different initializations cause result drift.

Conclusion: Use silhouette score to compare candidate k, and use k-means++ with reasonable n_init to solidify stability; note that since scikit-learn 1.4, n_init defaults to auto.

Output: Reusable silhouette analysis chart template + initialization parameter value rules + common error/version difference quick reference.

Case: Selecting n_clusters Based on Silhouette Score

Write Code

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

# Assuming X is already defined dataset
# for loop, cluster count from 2 to 10, step 2
for i in range(2, 10, 2):
    # Create figure and subplots
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # Set first subplot x and y axis ranges
    ax1.set_xlim([-0.1, 1])
    ax1.set_ylim([0, X.shape[0] + (i + 1) * 10])

    # Generate KMeans model and fit data
    clusterer = KMeans(n_clusters=i, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # Calculate silhouette score
    silhouette_avg = silhouette_score(X, cluster_labels)
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    print(f"Number of clusters = {i}, Mean silhouette score = {silhouette_avg}")

    y_lower = 10
    for j in range(i):
        # Get silhouette values for j-th cluster
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == j]
        ith_cluster_silhouette_values.sort()

        size_cluster_j = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_j

        color = cm.nipy_spectral(float(j) / i)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          ith_cluster_silhouette_values,
                          facecolor=color, alpha=0.5)

        ax1.text(-0.05, y_lower + 0.5 * size_cluster_j, str(j))

        y_lower = y_upper + 10

    ax1.set_title("Silhouette Plot for Different Clusters")
    ax1.set_xlabel("Silhouette Coefficient Value")
    ax1.set_ylabel("Cluster Label")
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
    ax1.set_yticks([])

    # Subplot 2: Scatter plot of clusters
    colors = cm.nipy_spectral(cluster_labels.astype(float) / i)
    ax2.scatter(X[:, 0], X[:, 1], marker='o', s=8, c=colors)

    # Draw cluster centroids
    centers = clusterer.cluster_centers_
    ax2.scatter(centers[:, 0], centers[:, 1], marker='x', c="red", s=200)

    ax2.set_title("Cluster Data Visualization")
    ax2.set_xlabel("Feature Space of First Feature")
    ax2.set_ylabel("Feature Space of Second Feature")

    plt.suptitle(f"KMeans Silhouette Analysis for Sample Data -- Number of Clusters: {i}", fontsize=14, fontweight='bold')
    plt.show()

Important Parameters - Initial Centroid Selection

init

Can input “k-means++”, “random”, or an n-dimensional array.

Method for initializing centroids, default “k-means++”
Input “k-means++”: A smart method to select initial cluster centers for K-means to speed up convergence
If an N-dimensional array is input, array shape should be (n_clusters, n_features) and provides initial centroids

n_init

Integer, default 10.

Number of times KMeans algorithm runs with different random initialization seeds
Final result will be based on Inertia calculation
Best output after n_init consecutive runs

random_state

Controls randomness, used for reproducing results.

Important Parameters - Iteration Stopping

max_iter (Maximum iterations):
- Default is usually 300 iterations
- Setting this parameter prevents infinite loops
tol (Tolerance):
- Defined as the decrease in Inertia between two consecutive iterations
- Default is usually 1e-4

Understanding Silhouette Analysis Charts

The silhouette analysis chart consists of two parts:

Silhouette Plot (left): Shows silhouette score for each sample
- Each vertical bar represents one cluster
- Bar width represents number of samples
- Red dashed line is mean silhouette score
Cluster Scatter Plot (right): Shows actual cluster distribution
- Points colored by cluster
- X marks cluster centroids

Interpreting Results

All clusters should have bars extending past the red dashed line
Bars should have similar widths (balanced clusters)
No cluster should have predominantly negative values
When mean silhouette ≠ best, check for:
- Intra-cluster splitting
- Inter-cluster overlap
- Outliers pulling down scores

Version Differences (scikit-learn 1.4+)

Important: Since scikit-learn 1.4 (2024):

n_init defaults to 'auto' instead of 10
'auto' automatically chooses based on data
This may cause different results than expected

Recommendation: Always explicitly specify n_init for reproducibility:

KMeans(n_clusters=3, n_init=10, random_state=42)

Error Quick Reference

Symptom	Root Cause	Fix
ValueError: Number of labels is 1	Clustering result has only 1 cluster	Check n_clusters and data; ensure n_clusters>=2
Same k, each run result significantly different	Initialization leads to different local optima	Fix random_state; explicitly increase n_init
Think n_init defaults to 10, but result looks like only ran once	scikit-learn 1.4+ n_init defaults to ‘auto’	Explicitly write n_init=10/20/…
ConvergenceWarning: Number of distinct clusters found smaller than n_clusters	Duplicate points in data/too few discrete values	Count unique samples; reduce n_clusters