Big Data 214 - K-Means Clustering: sklearn Implementation

TL;DR

Scenario: Verify self-written K-Means on 2D testSet, then switch to sklearn KMeans for real data and understand key parameters with labels_.

Conclusion: Self-written version focuses on data shape and label column conventions; sklearn focuses on n_clusters selection, convergence, and version parameter compatibility (algorithm=‘auto’ risk).

Output: Reusable verification pipeline (read → visualize → cluster → centroid visualization) + KMeans parameters and common errors quick reference.

Algorithm Verification

After writing the functions, first test the model on testSet dataset (using a 2D dataset for intuitive clustering results). testSet is a 2D dataset where each observation has only two features, with spaces between data, so pd.read_table() can be used.

testSet = pd.read_table('testSet.txt', header=None)
testSet.head()
testSet.shape

Then use a 2D plane plot to observe its distribution:

plt.scatter(testSet.iloc[:,0], testSet.iloc[:,1]);

From the plot, data appears to be distributed in four corners of the space. Next, we verify this by applying our K-Means algorithm. Before running, add a dummy label column:

label = pd.DataFrame(np.zeros(testSet.shape[0]).reshape(-1, 1))
test_set = pd.concat([testSet, label], axis=1, ignore_index = True)
test_set.head()

Run the algorithm. Based on the 2D plane distribution, we can set four centroids, dividing into four clusters:

test_cent, test_cluster = kMeans(test_set, 4)
test_cent
test_cluster.head()

Visualize the clustering results:

import matplotlib.pyplot as plt
# Draw clustered points
plt.scatter(test_cluster.iloc[:, 0], test_cluster.iloc[:, 1], c=test_cluster.iloc[:, -1], cmap='viridis')
# Draw cluster centroids
plt.scatter(test_cent[:, 0], test_cent[:, 1], color='red', marker='x', s=100)
plt.title('Cluster Plot with Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

sklearn Implementation K-Means

from sklearn.cluster import KMeans
# KMeans initialization example
kmeans = KMeans(
    n_clusters=8,               # Number of clusters
    init='k-means++',          # Method for initializing centroids
    n_init=10,                 # Number of times KMeans algorithm reruns
    max_iter=300,              # Maximum iterations
    tol=0.0001,                # Tolerance, controls convergence threshold
    verbose=0,                 # Controls log output verbosity
    random_state=None,         # Random seed controlling clustering randomness
    copy_x=True,               # Whether to copy X data
    algorithm='auto'           # KMeans algorithm used, 'auto' deprecated, recommend 'lloyd'
)

n_clusters

n_clusters is the K in K-Means, representing how many classes we tell the model to divide into. This is the only required parameter in K-Means, defaulting to 8 classes. However, clustering results are usually less than 8. Usually, before clustering, we don’t know what n_clusters should be, so we need to explore it.

cluster.labels

Important attribute labels_, view the clustered categories, the corresponding class for each sample:

from sklearn.cluster import KMeans
from sklearn.datasets import load_breast_cancer
import numpy as np

# Load dataset
data = load_breast_cancer()
X = data.data

# Define number of clusters
n_clusters = 3

# Use KMeans for clustering
cluster = KMeans(n_clusters=n_clusters, random_state=0).fit(X)

# Get cluster labels
y_pred = cluster.labels_

# Output cluster labels
print(y_pred)

predict and fit_predict

  • predict: Learn data X and predict classes for X
  • fit_predict: Can predict without needing to call .fit() first
# For full data
# classifier fit().predict result = classifier.fit_predict(X) = cluster.labels

Key Parameters Summary

ParameterDescriptionCommon Values
n_clustersNumber of clusters (K)2-10 (data-dependent)
initInitialization method’k-means++’ (default), ‘random’
n_initNumber of initializations10 (default)
max_iterMaximum iterations300 (default)
tolConvergence tolerance1e-4 (default)
random_stateRandom seedInteger for reproducibility
algorithmAlgorithm type’lloyd’ (recommended)

Error Quick Reference

SymptomRoot CauseFix
K-Means clustering abnormal/all points in one classInput data column convention inconsistencyCheck kMeans internal feature column slicing logic
sklearn KMeans deprecation/warningalgorithm=‘auto’ may be deprecated in new versionsChange algorithm to ‘lloyd’
make_blobs centers=4, but later plotting as “two classes y=0/1”Example logic inconsistencyUnify as “4 clusters example” and explain y range
Cluster colors messy/can’t distinguish clustersc parameter label column dtype not as expectedEnsure label column is integer category; explicitly convert to int before plotting
Results not reproducible: centroids different each runrandom_state=NoneFix random_state; increase n_init if needed