Big Data 214 - K-Means Clustering: sklearn Implementation
TL;DR
Scenario: Verify self-written K-Means on 2D testSet, then switch to sklearn KMeans for real data and understand key parameters with labels_.
Conclusion: Self-written version focuses on data shape and label column conventions; sklearn focuses on n_clusters selection, convergence, and version parameter compatibility (algorithm=‘auto’ risk).
Output: Reusable verification pipeline (read → visualize → cluster → centroid visualization) + KMeans parameters and common errors quick reference.
Algorithm Verification
After writing the functions, first test the model on testSet dataset (using a 2D dataset for intuitive clustering results). testSet is a 2D dataset where each observation has only two features, with spaces between data, so pd.read_table() can be used.
testSet = pd.read_table('testSet.txt', header=None)
testSet.head()
testSet.shape
Then use a 2D plane plot to observe its distribution:
plt.scatter(testSet.iloc[:,0], testSet.iloc[:,1]);
From the plot, data appears to be distributed in four corners of the space. Next, we verify this by applying our K-Means algorithm. Before running, add a dummy label column:
label = pd.DataFrame(np.zeros(testSet.shape[0]).reshape(-1, 1))
test_set = pd.concat([testSet, label], axis=1, ignore_index = True)
test_set.head()
Run the algorithm. Based on the 2D plane distribution, we can set four centroids, dividing into four clusters:
test_cent, test_cluster = kMeans(test_set, 4)
test_cent
test_cluster.head()
Visualize the clustering results:
import matplotlib.pyplot as plt
# Draw clustered points
plt.scatter(test_cluster.iloc[:, 0], test_cluster.iloc[:, 1], c=test_cluster.iloc[:, -1], cmap='viridis')
# Draw cluster centroids
plt.scatter(test_cent[:, 0], test_cent[:, 1], color='red', marker='x', s=100)
plt.title('Cluster Plot with Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
sklearn Implementation K-Means
from sklearn.cluster import KMeans
# KMeans initialization example
kmeans = KMeans(
n_clusters=8, # Number of clusters
init='k-means++', # Method for initializing centroids
n_init=10, # Number of times KMeans algorithm reruns
max_iter=300, # Maximum iterations
tol=0.0001, # Tolerance, controls convergence threshold
verbose=0, # Controls log output verbosity
random_state=None, # Random seed controlling clustering randomness
copy_x=True, # Whether to copy X data
algorithm='auto' # KMeans algorithm used, 'auto' deprecated, recommend 'lloyd'
)
n_clusters
n_clusters is the K in K-Means, representing how many classes we tell the model to divide into. This is the only required parameter in K-Means, defaulting to 8 classes. However, clustering results are usually less than 8. Usually, before clustering, we don’t know what n_clusters should be, so we need to explore it.
cluster.labels
Important attribute labels_, view the clustered categories, the corresponding class for each sample:
from sklearn.cluster import KMeans
from sklearn.datasets import load_breast_cancer
import numpy as np
# Load dataset
data = load_breast_cancer()
X = data.data
# Define number of clusters
n_clusters = 3
# Use KMeans for clustering
cluster = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
# Get cluster labels
y_pred = cluster.labels_
# Output cluster labels
print(y_pred)
predict and fit_predict
- predict: Learn data X and predict classes for X
- fit_predict: Can predict without needing to call .fit() first
# For full data
# classifier fit().predict result = classifier.fit_predict(X) = cluster.labels
Key Parameters Summary
| Parameter | Description | Common Values |
|---|---|---|
| n_clusters | Number of clusters (K) | 2-10 (data-dependent) |
| init | Initialization method | ’k-means++’ (default), ‘random’ |
| n_init | Number of initializations | 10 (default) |
| max_iter | Maximum iterations | 300 (default) |
| tol | Convergence tolerance | 1e-4 (default) |
| random_state | Random seed | Integer for reproducibility |
| algorithm | Algorithm type | ’lloyd’ (recommended) |
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| K-Means clustering abnormal/all points in one class | Input data column convention inconsistency | Check kMeans internal feature column slicing logic |
| sklearn KMeans deprecation/warning | algorithm=‘auto’ may be deprecated in new versions | Change algorithm to ‘lloyd’ |
| make_blobs centers=4, but later plotting as “two classes y=0/1” | Example logic inconsistency | Unify as “4 clusters example” and explain y range |
| Cluster colors messy/can’t distinguish clusters | c parameter label column dtype not as expected | Ensure label column is integer category; explicitly convert to int before plotting |
| Results not reproducible: centroids different each run | random_state=None | Fix random_state; increase n_init if needed |