Big Data 216 - KMeans n_clusters Selection

TL;DR

Scenario: KMeans doesn’t know how many clusters to choose, and different initializations cause result drift.

Conclusion: Use silhouette score to compare candidate k, and use k-means++ with reasonable n_init to solidify stability; note that since scikit-learn 1.4, n_init defaults to auto.

Output: Reusable silhouette analysis chart template + initialization parameter value rules + common error/version difference quick reference.

Case: Selecting n_clusters Based on Silhouette Score

Write Code

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

# Assuming X is already defined dataset
# for loop, cluster count from 2 to 10, step 2
for i in range(2, 10, 2):
    # Create figure and subplots
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # Set first subplot x and y axis ranges
    ax1.set_xlim([-0.1, 1])
    ax1.set_ylim([0, X.shape[0] + (i + 1) * 10])

    # Generate KMeans model and fit data
    clusterer = KMeans(n_clusters=i, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # Calculate silhouette score
    silhouette_avg = silhouette_score(X, cluster_labels)
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    print(f"Number of clusters = {i}, Mean silhouette score = {silhouette_avg}")

    y_lower = 10
    for j in range(i):
        # Get silhouette values for j-th cluster
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == j]
        ith_cluster_silhouette_values.sort()

        size_cluster_j = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_j

        color = cm.nipy_spectral(float(j) / i)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          ith_cluster_silhouette_values,
                          facecolor=color, alpha=0.5)

        ax1.text(-0.05, y_lower + 0.5 * size_cluster_j, str(j))

        y_lower = y_upper + 10

    ax1.set_title("Silhouette Plot for Different Clusters")
    ax1.set_xlabel("Silhouette Coefficient Value")
    ax1.set_ylabel("Cluster Label")
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
    ax1.set_yticks([])

    # Subplot 2: Scatter plot of clusters
    colors = cm.nipy_spectral(cluster_labels.astype(float) / i)
    ax2.scatter(X[:, 0], X[:, 1], marker='o', s=8, c=colors)

    # Draw cluster centroids
    centers = clusterer.cluster_centers_
    ax2.scatter(centers[:, 0], centers[:, 1], marker='x', c="red", s=200)

    ax2.set_title("Cluster Data Visualization")
    ax2.set_xlabel("Feature Space of First Feature")
    ax2.set_ylabel("Feature Space of Second Feature")

    plt.suptitle(f"KMeans Silhouette Analysis for Sample Data -- Number of Clusters: {i}", fontsize=14, fontweight='bold')
    plt.show()

Important Parameters - Initial Centroid Selection

init

Can input “k-means++”, “random”, or an n-dimensional array.

  • Method for initializing centroids, default “k-means++”
  • Input “k-means++”: A smart method to select initial cluster centers for K-means to speed up convergence
  • If an N-dimensional array is input, array shape should be (n_clusters, n_features) and provides initial centroids

n_init

Integer, default 10.

  • Number of times KMeans algorithm runs with different random initialization seeds
  • Final result will be based on Inertia calculation
  • Best output after n_init consecutive runs

random_state

Controls randomness, used for reproducing results.

Important Parameters - Iteration Stopping

  1. max_iter (Maximum iterations):

    • Default is usually 300 iterations
    • Setting this parameter prevents infinite loops
  2. tol (Tolerance):

    • Defined as the decrease in Inertia between two consecutive iterations
    • Default is usually 1e-4

Understanding Silhouette Analysis Charts

The silhouette analysis chart consists of two parts:

  1. Silhouette Plot (left): Shows silhouette score for each sample

    • Each vertical bar represents one cluster
    • Bar width represents number of samples
    • Red dashed line is mean silhouette score
  2. Cluster Scatter Plot (right): Shows actual cluster distribution

    • Points colored by cluster
    • X marks cluster centroids

Interpreting Results

  • All clusters should have bars extending past the red dashed line
  • Bars should have similar widths (balanced clusters)
  • No cluster should have predominantly negative values
  • When mean silhouette ≠ best, check for:
    • Intra-cluster splitting
    • Inter-cluster overlap
    • Outliers pulling down scores

Version Differences (scikit-learn 1.4+)

Important: Since scikit-learn 1.4 (2024):

  • n_init defaults to 'auto' instead of 10
  • 'auto' automatically chooses based on data
  • This may cause different results than expected

Recommendation: Always explicitly specify n_init for reproducibility:

KMeans(n_clusters=3, n_init=10, random_state=42)

Error Quick Reference

SymptomRoot CauseFix
ValueError: Number of labels is 1Clustering result has only 1 clusterCheck n_clusters and data; ensure n_clusters>=2
Same k, each run result significantly differentInitialization leads to different local optimaFix random_state; explicitly increase n_init
Think n_init defaults to 10, but result looks like only ran oncescikit-learn 1.4+ n_init defaults to ‘auto’Explicitly write n_init=10/20/…
ConvergenceWarning: Number of distinct clusters found smaller than n_clustersDuplicate points in data/too few discrete valuesCount unique samples; reduce n_clusters