Big Data 213 - Python Hand-Written K-Means Clustering
TL;DR
Scenario: Hand-write K-Means using NumPy/Pandas, perform 3-class clustering on Iris.txt and output centroids with clustering results.
Conclusion: Implementation works but needs “empty cluster handling / max iterations / version and data type constraints / feature scale” for engineering stability.
Output: Complete distEclud + randCent + kMeans pipeline, result table result_set, common error diagnosis and fix quick reference.
Python Implementation
Import Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Fix negative sign display issue on axes
plt.rcParams['axes.unicode_minus'] = False
# Fix Chinese font issue
plt.rcParams['font.sans-serif'] = ['Simhei']
Import Dataset
Using the Iris dataset as example:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
# Import dataset
iris = pd.read_csv("iris.txt",header = None)
iris.head()
iris.shape
Write Distance Calculation Function
We need to define a function to calculate Euclidean distance between two arrays of equal length. When not directly applying the result but only comparing distances, we can use squared distance sum instead of actual distance, simplifying the square root computation and reducing calculation load. Additionally, for distance calculations, feature scale unification is crucial—different scales easily bias the model toward features with larger magnitudes.
Function: Calculate Euclidean distance between two datasets Input: Two array datasets Return: Euclidean distance between datasets (using squared distance sum here)
def distEclud(arrA, arrB):
d = arrA - arrB
dist = np.sum(np.power(d, 2), axis=1)
return dist
Write Random Centroid Generation Function
When defining the random centroid generation function, follow these steps:
- Calculate data range: Iterate through each column (feature) in the dataset, calculate minimum (min) and maximum (max)
- Generate random centroids: Generate k centroids based on user-specified number of clusters
Parameters:
- Input: dataSet (complete dataset with labels), k (number of centroids to generate)
- Output: data_cent (generated k centroids)
def randCent(dataSet, k):
# n is the number of columns, assuming dataSet is a DataFrame
n = dataSet.shape[1]
# Get min and max for each column (only use first n-1 columns, last column is label/category)
data_min = dataSet.iloc[:, :n-1].min()
data_max = dataSet.iloc[:, :n-1].max()
# Generate k random center points between min and max
data_cent = np.random.uniform(data_min, data_max, (k, n-1))
return data_cent
Write K-Means Clustering Function
def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
# Get dataset dimensions, m is rows, n is columns
m, n = dataSet.shape
# Initialize centroids, generate k random centroids
centroids = createCent(dataSet, k)
# Initialize clusterAssment matrix
clusterAssment = np.zeros((m, 3))
clusterAssment[:, 0] = np.inf
clusterAssment[:, 1:3] = -1
# Merge dataset and clusterAssment to form result_set
result_set = pd.concat([dataSet, pd.DataFrame(clusterAssment)], axis=1, ignore_index=True)
# Flag to track if cluster changed
clusterChanged = True
while clusterChanged:
clusterChanged = False
# Iterate through each sample point
for i in range(m):
# Calculate distance from current point to all centroids
dist = distMeas(dataSet.iloc[i, :n-1].values, centroids)
# Record minimum distance and corresponding centroid index
result_set.iloc[i, n] = dist.min()
result_set.iloc[i, n+1] = np.where(dist == dist.min())[0][0]
# Check if current cluster assignment matches previous
clusterChanged = not (result_set.iloc[:, -1] == result_set.iloc[:, -2]).all()
# If cluster assignment changed, update centroids
if clusterChanged:
cent_df = result_set.groupby(n+1).mean()
centroids = cent_df.iloc[:, :n-1].values
result_set.iloc[:, -1] = result_set.iloc[:, -2]
return centroids, result_set
Engineering Considerations
Feature Scale Unification
K-Means uses distance calculations, so feature scale differences can severely bias results toward features with larger magnitudes. Always standardize or normalize features before clustering.
Random Seed Setting
For reproducible results, set random seed before generating centroids:
np.random.seed(42)
Max Iterations
To prevent infinite loops, add max iteration limit:
max_iter = 100
Empty Cluster Handling
When no points are assigned to a cluster, the centroid disappears. Detect and handle:
# Check for empty clusters
cluster_counts = result_set.iloc[:, n+1].value_counts()
# Reset centroids for empty clusters
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| iris.txt read error or empty | Wrong path/file not in working directory | Use os.getcwd(), check pd.read_csv error; use absolute path |
iris.shape shows unexpected columns | Delimiter/encoding issue | Check with iris.head(); specify sep=',' |
| K-Means iteration TypeError/string to float | Label column is string but participates in numeric operations | Check dataSet.dtypes; result_set.groupby(...).mean() only calculates for numeric columns |
| Centroid count reduced (k=3 becomes 2) or NaN | Empty cluster: no samples assigned | Check result_set.iloc[:, n+1].value_counts(); detect missing clusters and reset centroids |
| Program runs long time or “stuck” | Missing max_iter | Add max_iter and tolerance threshold to while loop |
| Large variation between runs, hard to reproduce | Random seed not set | Set np.random.seed(...) before randCent |
| Clustering biased toward one dimension, unreasonable | Feature scales not unified | Compare column ranges min/max; standardize/normalize first |