Big Data 195 - KNN/K-Nearest Neighbors Algorithm Practice
TL;DR
- Scenario: Supervised classification with small samples, low-dimensional features, use “similar sample voting” for quick results
- Conclusion: KNN core is distance metric + K neighbor voting; feature scale determines distance reliability, K affects bias/variance
- Output: A set of reproducible experiments (wine binary classification), distance calculation/sorting/voting flow, KNN function encapsulation
Supervised Learning Algorithm
KNN/K-Nearest Neighbors Algorithm
The core idea of K-Nearest Neighbors (KNN) is determining similarity based on distances between samples. Specifically, it calculates the feature space distance between the sample to be classified and each sample in the training set (common distance metrics include Euclidean distance, Manhattan distance or cosine similarity, etc.). If two samples are close enough in feature space, they are considered to have high similarity and likely belong to the same category.
In practical applications, relying on only a single nearest neighbor sample for classification is easily affected by noise and outliers, causing unstable classification results. Therefore, the KNN algorithm selects the K nearest samples (i.e., K nearest neighbors), these neighbor samples form the local neighborhood of the sample to be classified. The algorithm counts the category label distribution among these neighbor samples (labels represent the true category of samples, like “cat”, “dog”, etc. classification results). Then uses voting mechanism: the category with most occurrences among K neighbors is used as the prediction result for the sample to be classified.
For example, in image classification, assuming K=5, among the 5 nearest neighbors of the image to be classified, there are 3 “cat” and 2 “dog” images, then the algorithm will classify this image as “cat”. This local neighborhood voting mechanism gives KNN good robustness to noisy data, while K value selection (usually determined through cross-validation) directly affects the algorithm’s classification performance.
Implementation Process
Assume X_test is the sample to be labeled, X_train is the labeled dataset.
- Iterate through all samples in the labeled dataset, calculate distance from each sample to the point to be labeled, and store distances in Distance array
- Sort Distance array, take the K nearest points, record as X_knn
- Count each category in X_knn: how many class0 samples in X_knn, how many class1 samples in X_knn
- The category of the sample to be labeled is the category with most samples in X_knn
Distance Determination
The algorithm’s [distance] on a two-dimensional coordinate axis represents the distance between two points. There are many formulas for calculating distance. What we commonly call is the Euler formula, i.e., “Euclidean distance”.
Euclidean distance formula: $$d(A, B) = \sqrt{(x_1-x_2)^2 + (y_1-y_2)^2}$$
When there are many features forming multi-dimensional space, in N-dimensional space there are two points A and B with coordinates: $$d(A, B) = \sqrt{\sum_{i=1}^{n}(a_i - b_i)^2}$$
In machine learning, x1, x2, x3 on the coordinate axis are exactly the N features of our samples.
Algorithm Advantages
-
K value mechanism:
- When k=1, model only considers one nearest sample point, making decision boundary very complex
- As K value increases, model considers more neighbors’ voting results, making decision boundary tend toward smooth
-
K value and model bias relationship:
- Larger K values (e.g., k=15) increase model bias because decisions are based on average of more samples
- This makes model less sensitive to individual noisy data points (like mislabeled samples)
- In extreme case, when K approaches training set size, model will simply predict the majority class
-
K value and model variance relationship:
- Smaller K values (e.g., k=1 or 3) increase model variance
- Model easily captures random fluctuations and noise in training data
-
Parameter selection practical experience:
- Usually start trying from k=5, this is an empirical rule
- Can find optimal K through cross-validation, common method is plotting K value vs accuracy curve
- In sklearn, can use GridSearchCV for K value tuning
-
K value selection in different scenarios:
- For noisy datasets (like sensor data), recommend larger K values (7-15)
- For clearly separable data (like MNIST handwritten digits), smaller K values (3-5) may be more suitable
- When class distribution is imbalanced, K value shouldn’t be too small, otherwise easily affected by minority class samples
Algorithm Variants
Variant 1: Weighted KNN
By default, weights are the same when calculating distance, but can actually specify different distance weights for different neighbors, e.g., closer distance has higher weight. Can be achieved by specifying algorithm’s weights parameter.
Variant 2: Radius Neighbors
Uses points within a certain radius instead of K nearest points. In scikit-learn, RadiusNeighborsClassifier implements this variant. When data sampling is uneven, this variant can achieve better performance.
Code Implementation
Import related packages
# All lines can output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Solve minus sign garble on axis
plt.rcParams['axes.unicode_minus'] = False
# Solve Chinese garble
plt.rcParams['font.sans-serif'] = ['Simhei']
plt.style.use('ggplot')
Build pre-classified original dataset
rowdata = {
'Color Depth': [14.13,13.2,13.16,14.27,13.24,12.07,12.43,11.79,12.37,12.04],
'Alcohol': [5.64,4.28,5.68,4.80,4.22,2.76,3.94,3.1,2.12,2.6],
'Variety': [0,0,0,0,0,1,1,1,1,1]
}
# 0 represents "Pinot Noir", 1 represents "Cabernet Sauvignon"
wine_data = pd.DataFrame(rowdata)
Data Exploration & Visualization
X = np.array(wine_data.iloc[:,0:2]) # We put features (wine attributes) in X
y = np.array(wine_data.iloc[:,-1]) # Put labels (wine categories) in Y
# Explore data, if we give new data [12.03,4.1], can you guess what category this wine is?
new_data = np.array([12.03,4.1])
plt.scatter(X[y==1,0], X[y==1,1], color='red', label='Cabernet Sauvignon')
plt.scatter(X[y==0,0], X[y==0,1], color='purple', label='Pinot Noir')
plt.scatter(new_data[0],new_data[1], color='yellow')
plt.xlabel('Alcohol')
plt.ylabel('Color Depth')
plt.legend(loc='lower right')
plt.savefig('Wine Samples.png')
Calculate distances between points in known category dataset and current point
We use Euclidean distance formula, calculate distance between new data point new_data and each point in existing X dataset:
from math import sqrt
distance = [sqrt(np.sum((x-new_data)**2)) for x in X]
distance
Execution result:
[2.6041505332833594,
1.1837651794169315,
1.9424983912477256,
2.3468276459936295,
1.2159358535712326,
1.3405968819895113,
0.4308131845707605,
1.0283968105745949,
2.0089798406156287,
1.500033332962971]
Sort distances ascending, select K points with smallest distance
sort_dist = np.argsort(distance)
sort_dist
Determine count of categories for first K points
k = 3
topK = [y[i] for i in sort_dist[:k]]
topK
# Voting statistics
pd.Series(topK).value_counts().index[0]
Function Encapsulation
def KNN(new_data,dataSet,k):
'''
Function: KNN classifier
Parameters:
new_data: Dataset to predict classification
dataSet: Dataset with known classification labels
k: K-nearest neighbors parameter, select K points with smallest distance
Return:
result: Classification result
'''
from math import sqrt
from collections import Counter
import numpy as np
import pandas as pd
result = []
distance = [sqrt(np.sum((x-new_data)**2)) for x in np.array(dataSet.iloc[:,0:2])]
sort_dist = np.argsort(distance)
topK = [dataSet.iloc[:,-1][i] for i in sort_dist[:k]]
result.append(pd.Series(topK).value_counts().index[0])
return result
# Test function
new_data = np.array([12.03,4.1])
k = 3
KNN(new_data,wine_data,k)
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| Chinese/minus sign garble | Font not installed or font name mismatch | Install/specify available Chinese font (like SimHei/Microsoft YaHei); keep plt.rcParams['axes.unicode_minus']=False |
| Scatter plot axis meanings don’t match | X column order inconsistent with xlabel/ylabel semantics | Unify “X’s 0/1 column” with axis labels |
| KNN results unstable/biased toward certain class | K too small/large; class imbalance; distance dominated by scale | Do feature scaling; use cross-validation to select K; if needed use weighted KNN |
| High-dimensional data effect significantly worse | Distance concentration (curse of dimensionality), Euclidean distance differentiation declines | Dimensionality reduction/feature selection; change distance metric |
| Radius neighbors can’t find neighbors | Radius setting unreasonable, uneven sampling | Adjust radius; set fallback strategy |