Big Data 195 - KNN/K-Nearest Neighbors Algorithm Practice

TL;DR

Scenario: Supervised classification with small samples, low-dimensional features, use “similar sample voting” for quick results
Conclusion: KNN core is distance metric + K neighbor voting; feature scale determines distance reliability, K affects bias/variance
Output: A set of reproducible experiments (wine binary classification), distance calculation/sorting/voting flow, KNN function encapsulation

Supervised Learning Algorithm

KNN/K-Nearest Neighbors Algorithm

The core idea of K-Nearest Neighbors (KNN) is determining similarity based on distances between samples. Specifically, it calculates the feature space distance between the sample to be classified and each sample in the training set (common distance metrics include Euclidean distance, Manhattan distance or cosine similarity, etc.). If two samples are close enough in feature space, they are considered to have high similarity and likely belong to the same category.

In practical applications, relying on only a single nearest neighbor sample for classification is easily affected by noise and outliers, causing unstable classification results. Therefore, the KNN algorithm selects the K nearest samples (i.e., K nearest neighbors), these neighbor samples form the local neighborhood of the sample to be classified. The algorithm counts the category label distribution among these neighbor samples (labels represent the true category of samples, like “cat”, “dog”, etc. classification results). Then uses voting mechanism: the category with most occurrences among K neighbors is used as the prediction result for the sample to be classified.

For example, in image classification, assuming K=5, among the 5 nearest neighbors of the image to be classified, there are 3 “cat” and 2 “dog” images, then the algorithm will classify this image as “cat”. This local neighborhood voting mechanism gives KNN good robustness to noisy data, while K value selection (usually determined through cross-validation) directly affects the algorithm’s classification performance.

Implementation Process

Assume X_test is the sample to be labeled, X_train is the labeled dataset.

Iterate through all samples in the labeled dataset, calculate distance from each sample to the point to be labeled, and store distances in Distance array
Sort Distance array, take the K nearest points, record as X_knn
Count each category in X_knn: how many class0 samples in X_knn, how many class1 samples in X_knn
The category of the sample to be labeled is the category with most samples in X_knn

Distance Determination

The algorithm’s [distance] on a two-dimensional coordinate axis represents the distance between two points. There are many formulas for calculating distance. What we commonly call is the Euler formula, i.e., “Euclidean distance”.

Euclidean distance formula: $$d(A, B) = \sqrt{(x_1-x_2)^2 + (y_1-y_2)^2}$$

When there are many features forming multi-dimensional space, in N-dimensional space there are two points A and B with coordinates: $$d(A, B) = \sqrt{\sum_{i=1}^{n}(a_i - b_i)^2}$$

In machine learning, x1, x2, x3 on the coordinate axis are exactly the N features of our samples.

Algorithm Advantages

K value mechanism:
- When k=1, model only considers one nearest sample point, making decision boundary very complex
- As K value increases, model considers more neighbors’ voting results, making decision boundary tend toward smooth
K value and model bias relationship:
- Larger K values (e.g., k=15) increase model bias because decisions are based on average of more samples
- This makes model less sensitive to individual noisy data points (like mislabeled samples)
- In extreme case, when K approaches training set size, model will simply predict the majority class
K value and model variance relationship:
- Smaller K values (e.g., k=1 or 3) increase model variance
- Model easily captures random fluctuations and noise in training data
Parameter selection practical experience:
- Usually start trying from k=5, this is an empirical rule
- Can find optimal K through cross-validation, common method is plotting K value vs accuracy curve
- In sklearn, can use GridSearchCV for K value tuning
K value selection in different scenarios:
- For noisy datasets (like sensor data), recommend larger K values (7-15)
- For clearly separable data (like MNIST handwritten digits), smaller K values (3-5) may be more suitable
- When class distribution is imbalanced, K value shouldn’t be too small, otherwise easily affected by minority class samples

Algorithm Variants

Variant 1: Weighted KNN

By default, weights are the same when calculating distance, but can actually specify different distance weights for different neighbors, e.g., closer distance has higher weight. Can be achieved by specifying algorithm’s weights parameter.

Variant 2: Radius Neighbors

Uses points within a certain radius instead of K nearest points. In scikit-learn, RadiusNeighborsClassifier implements this variant. When data sampling is uneven, this variant can achieve better performance.

Code Implementation

# All lines can output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Solve minus sign garble on axis
plt.rcParams['axes.unicode_minus'] = False
# Solve Chinese garble
plt.rcParams['font.sans-serif'] = ['Simhei']
plt.style.use('ggplot')

Build pre-classified original dataset

rowdata = {
    'Color Depth': [14.13,13.2,13.16,14.27,13.24,12.07,12.43,11.79,12.37,12.04],
    'Alcohol': [5.64,4.28,5.68,4.80,4.22,2.76,3.94,3.1,2.12,2.6],
    'Variety': [0,0,0,0,0,1,1,1,1,1]
}
# 0 represents "Pinot Noir", 1 represents "Cabernet Sauvignon"
wine_data = pd.DataFrame(rowdata)

Data Exploration & Visualization

X = np.array(wine_data.iloc[:,0:2]) # We put features (wine attributes) in X
y = np.array(wine_data.iloc[:,-1]) # Put labels (wine categories) in Y
# Explore data, if we give new data [12.03,4.1], can you guess what category this wine is?
new_data = np.array([12.03,4.1])
plt.scatter(X[y==1,0], X[y==1,1], color='red', label='Cabernet Sauvignon')
plt.scatter(X[y==0,0], X[y==0,1], color='purple', label='Pinot Noir')
plt.scatter(new_data[0],new_data[1], color='yellow')
plt.xlabel('Alcohol')
plt.ylabel('Color Depth')
plt.legend(loc='lower right')
plt.savefig('Wine Samples.png')

Calculate distances between points in known category dataset and current point

We use Euclidean distance formula, calculate distance between new data point new_data and each point in existing X dataset:

from math import sqrt
distance = [sqrt(np.sum((x-new_data)**2)) for x in X]
distance

Execution result:

[2.6041505332833594,
 1.1837651794169315,
 1.9424983912477256,
 2.3468276459936295,
 1.2159358535712326,
 1.3405968819895113,
 0.4308131845707605,
 1.0283968105745949,
 2.0089798406156287,
 1.500033332962971]

Sort distances ascending, select K points with smallest distance

sort_dist = np.argsort(distance)
sort_dist

Determine count of categories for first K points

k = 3
topK = [y[i] for i in sort_dist[:k]]
topK

# Voting statistics
pd.Series(topK).value_counts().index[0]

Function Encapsulation

def KNN(new_data,dataSet,k):
    '''
    Function: KNN classifier
    Parameters:
    new_data: Dataset to predict classification
    dataSet: Dataset with known classification labels
    k: K-nearest neighbors parameter, select K points with smallest distance
    Return:
    result: Classification result
    '''
    from math import sqrt
    from collections import Counter
    import numpy as np
    import pandas as pd

    result = []
    distance = [sqrt(np.sum((x-new_data)**2)) for x in np.array(dataSet.iloc[:,0:2])]
    sort_dist = np.argsort(distance)
    topK = [dataSet.iloc[:,-1][i] for i in sort_dist[:k]]
    result.append(pd.Series(topK).value_counts().index[0])
    return result

# Test function
new_data = np.array([12.03,4.1])
k = 3
KNN(new_data,wine_data,k)

Error Quick Reference

Symptom	Root Cause	Fix
Chinese/minus sign garble	Font not installed or font name mismatch	Install/specify available Chinese font (like SimHei/Microsoft YaHei); keep `plt.rcParams['axes.unicode_minus']=False`
Scatter plot axis meanings don’t match	X column order inconsistent with xlabel/ylabel semantics	Unify “X’s 0/1 column” with axis labels
KNN results unstable/biased toward certain class	K too small/large; class imbalance; distance dominated by scale	Do feature scaling; use cross-validation to select K; if needed use weighted KNN
High-dimensional data effect significantly worse	Distance concentration (curse of dimensionality), Euclidean distance differentiation declines	Dimensionality reduction/feature selection; change distance metric
Radius neighbors can’t find neighbors	Radius setting unreasonable, uneven sampling	Adjust radius; set fallback strategy