Big Data 213 - Python Hand-Written K-Means Clustering

TL;DR

Scenario: Hand-write K-Means using NumPy/Pandas, perform 3-class clustering on Iris.txt and output centroids with clustering results.

Conclusion: Implementation works but needs “empty cluster handling / max iterations / version and data type constraints / feature scale” for engineering stability.

Output: Complete distEclud + randCent + kMeans pipeline, result table result_set, common error diagnosis and fix quick reference.

Python Implementation

Import Dependencies

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Fix negative sign display issue on axes
plt.rcParams['axes.unicode_minus'] = False
# Fix Chinese font issue
plt.rcParams['font.sans-serif'] = ['Simhei']

Import Dataset

Using the Iris dataset as example:

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
# Import dataset
iris = pd.read_csv("iris.txt",header = None)
iris.head()
iris.shape

Write Distance Calculation Function

We need to define a function to calculate Euclidean distance between two arrays of equal length. When not directly applying the result but only comparing distances, we can use squared distance sum instead of actual distance, simplifying the square root computation and reducing calculation load. Additionally, for distance calculations, feature scale unification is crucial—different scales easily bias the model toward features with larger magnitudes.

Function: Calculate Euclidean distance between two datasets Input: Two array datasets Return: Euclidean distance between datasets (using squared distance sum here)

def distEclud(arrA, arrB):
    d = arrA - arrB
    dist = np.sum(np.power(d, 2), axis=1)
    return dist

Write Random Centroid Generation Function

When defining the random centroid generation function, follow these steps:

Calculate data range: Iterate through each column (feature) in the dataset, calculate minimum (min) and maximum (max)
Generate random centroids: Generate k centroids based on user-specified number of clusters

Parameters:

Input: dataSet (complete dataset with labels), k (number of centroids to generate)
Output: data_cent (generated k centroids)

def randCent(dataSet, k):
    # n is the number of columns, assuming dataSet is a DataFrame
    n = dataSet.shape[1]

    # Get min and max for each column (only use first n-1 columns, last column is label/category)
    data_min = dataSet.iloc[:, :n-1].min()
    data_max = dataSet.iloc[:, :n-1].max()

    # Generate k random center points between min and max
    data_cent = np.random.uniform(data_min, data_max, (k, n-1))

    return data_cent

Write K-Means Clustering Function

def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
    # Get dataset dimensions, m is rows, n is columns
    m, n = dataSet.shape

    # Initialize centroids, generate k random centroids
    centroids = createCent(dataSet, k)

    # Initialize clusterAssment matrix
    clusterAssment = np.zeros((m, 3))
    clusterAssment[:, 0] = np.inf
    clusterAssment[:, 1:3] = -1

    # Merge dataset and clusterAssment to form result_set
    result_set = pd.concat([dataSet, pd.DataFrame(clusterAssment)], axis=1, ignore_index=True)

    # Flag to track if cluster changed
    clusterChanged = True

    while clusterChanged:
        clusterChanged = False
        # Iterate through each sample point
        for i in range(m):
            # Calculate distance from current point to all centroids
            dist = distMeas(dataSet.iloc[i, :n-1].values, centroids)
            # Record minimum distance and corresponding centroid index
            result_set.iloc[i, n] = dist.min()
            result_set.iloc[i, n+1] = np.where(dist == dist.min())[0][0]

        # Check if current cluster assignment matches previous
        clusterChanged = not (result_set.iloc[:, -1] == result_set.iloc[:, -2]).all()

        # If cluster assignment changed, update centroids
        if clusterChanged:
            cent_df = result_set.groupby(n+1).mean()
            centroids = cent_df.iloc[:, :n-1].values
            result_set.iloc[:, -1] = result_set.iloc[:, -2]

    return centroids, result_set

Engineering Considerations

Feature Scale Unification

K-Means uses distance calculations, so feature scale differences can severely bias results toward features with larger magnitudes. Always standardize or normalize features before clustering.

Random Seed Setting

For reproducible results, set random seed before generating centroids:

np.random.seed(42)

Max Iterations

To prevent infinite loops, add max iteration limit:

max_iter = 100

Empty Cluster Handling

When no points are assigned to a cluster, the centroid disappears. Detect and handle:

# Check for empty clusters
cluster_counts = result_set.iloc[:, n+1].value_counts()
# Reset centroids for empty clusters

Error Quick Reference

Symptom	Root Cause	Fix
iris.txt read error or empty	Wrong path/file not in working directory	Use `os.getcwd()`, check `pd.read_csv` error; use absolute path
`iris.shape` shows unexpected columns	Delimiter/encoding issue	Check with `iris.head()`; specify `sep=','`
K-Means iteration TypeError/string to float	Label column is string but participates in numeric operations	Check `dataSet.dtypes`; `result_set.groupby(...).mean()` only calculates for numeric columns
Centroid count reduced (k=3 becomes 2) or NaN	Empty cluster: no samples assigned	Check `result_set.iloc[:, n+1].value_counts()`; detect missing clusters and reset centroids
Program runs long time or “stuck”	Missing `max_iter`	Add `max_iter` and tolerance threshold to while loop
Large variation between runs, hard to reproduce	Random seed not set	Set `np.random.seed(...)` before `randCent`
Clustering biased toward one dimension, unreasonable	Feature scales not unified	Compare column ranges `min/max`; standardize/normalize first