Big Data 196 - scikit-learn KNN Practice

scikit-learn Library Implementation

Since being initiated in 2007 by David Cournapeau, scikit-learn (sklearn) has become one of the most important machine learning libraries in the Python ecosystem. As an important component of NumPy and SciPy ecosystem, sklearn provides powerful and efficient toolset for data scientists and machine learning engineers.

In terms of functional architecture, scikit-learn mainly supports four core machine learning algorithm areas:

Classification Algorithms: Including Logistic Regression, Support Vector Machine (SVC), Random Forest (RandomForestClassifier), suitable for scenarios like customer churn prediction, spam email identification
Regression Algorithms: Like Linear Regression, Ridge Regression, can be used for house price prediction, sales forecasting and other tasks
Dimensionality Reduction: Principal Component Analysis (PCA), t-SNE and other algorithms, help with high-dimensional data visualization
Clustering Algorithms: K-means, DBSCAN, suitable for user segmentation, anomaly detection and other applications

Additionally, scikit-learn provides three key functional modules:

Feature Extraction: Including text feature vectorization (CountVectorizer), TF-IDF transformation, etc.
Data Preprocessing: Standardization (StandardScaler), Normalization (MinMaxScaler), Missing value handling, etc.
Model Evaluation: Cross-validation (cross_val_score), Multiple evaluation metrics (accuracy_score, f1_score, etc.)

Design Principles

Consistency All objects share a simple consistent interface

Estimators: fit() method, any object that estimates parameters based on data, used parameters are a dataset (corresponding to X, supervised algorithms also need Y), any other parameters that guide estimation process become hyperparameters, must be set as instance variables.
Transformers: transform() method, use estimator to transform dataset, transformation depends on learning parameters, can use convenience method: fit_transform(), equivalent to fit() then transform(), sometimes optimized for faster speed.
Predictors: predict() method, use estimator to predict new data, returns data containing prediction results, also has score() method: used to measure prediction quality on given test set (R² for continuous y, accuracy for classification y)

Inspection Check all parameters, all estimator hyperparameters can be accessed via public instance variables, all estimator learning parameters can be accessed via public instance variables with underscore suffix.

Prevention of Class Proliferation Object types fixed, datasets represented as Numpy arrays or Scipy sparse matrices, hyperparameters ordinary Python strings or numbers.

Composition Existing building blocks reused as much as possible, can easily create pipeline.

Sensible Defaults Most parameters provide reasonable defaults, can easily set up a basic working system.

Case 1: Wine

from sklearn.neighbors import KNeighborsClassifier

# 0 represents "Pinot Noir", 1 represents "Cabernet Sauvignon"
clf = KNeighborsClassifier(n_neighbors = 3)
clf = clf.fit(wine_data.iloc[:,0:2], wine_data.iloc[:,-1])
result = clf.predict([[12.8,4.1]]) # Returns predicted label
print(f"result: {result}")

# Evaluate model, score interface returns prediction accuracy
score = clf.score([[12.8,4.1]],[0])
print(f"score: {score}")

print(clf.predict_proba([[12.8,4.1]]))

Case 2: Breast Cancer

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np
# Load dataset
data = load_breast_cancer()
# Display as DataFrame format
X = data.data
y = data.target
name = ['Mean Radius','Mean Texture','Mean Perimeter','Mean Area',
'Mean Smoothness','Mean Compactness','Mean Concavity',
'Mean Concave Points','Mean Symmetry','Mean Fractal Dimension',
'Radius Error','Texture Error','Perimeter Error','Area Error',
'Smoothness Error','Compactness Error','Concavity Error',
'Concave Points Error','Symmetry Error',
'Fractal Dimension Error','Worst Radius','Worst Texture',
'Worst Perimeter','Worst Area','Worst Smoothness',
'Worst Compactness','Worst Concavity','Worst Concave Points',
'Worst Symmetry','Worst Fractal Dimension','Diagnosis']
data=np.concatenate((X,y.reshape(-1,1)),axis=1)
table=pd.DataFrame(data=data,columns=name)
table.head()

# Split training and test set # 30% data as test set
Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,y,test_size=0.2,random_state=420)
# Build model & evaluate model
clf = KNeighborsClassifier(n_neighbors=4)
# Build classifier
clf = clf.fit(Xtrain,Ytrain)
score = clf.score(Xtest,Ytest)
score

How to use above classifier fit results to find 4 “points” closest to row 20 and row 30 in Xtest?

# Find K neighbors of points. Returns distance and index values for each point's neighbors.
clf.kneighbors(Xtest[[20,30],:],return_distance=True)

Selecting Optimal K Value

One hyperparameter in KNN,所谓”超参数”, is one that needs to be manually input, algorithm cannot directly calculate this parameter. K in KNN represents the number of nearest samples to the test point X to be classified. If not input, the important part “select K nearest neighbors” cannot be implemented.

From KNN’s principle, whether appropriate K value can be confirmed has great impact on algorithm.

If selected K value is small, it，相当于较小的领域中的训练实例进行预测, at this time only training instances close to input instance will affect prediction result, but disadvantage is prediction result will be very sensitive to nearest instance points, if nearest instance points happen to be noise, prediction will be wrong.

Conversely, if selected K value is large, it，相当于较大的领域中的训练实例进行预测, at this time (不相似的) training instances far from input instance will also affect prediction, causing prediction errors. Therefore, K hyperparameter selection is KNN’s top issue.

Learning Curve

How do we select an optimal K value? Here need to use a tool in machine learning: parameter learning curve, parameter learning curve is one with different parameter values as x-axis, model results under different parameter values as y-axis, we often select parameter value where model performs best as this parameter’s value.

# Change different n_neighbors parameter values, observe result changes
clf = KNeighborsClassifier(n_neighbors=7)
clf = clf.fit(Xtrain,Ytrain)
score = clf.score(Xtest,Ytest)
score

Draw learning curve:

score = []
krange = range(1,20)
for i in krange:
    clf = KNeighborsClassifier(n_neighbors=i)
    clf = clf.fit(Xtrain,Ytrain)
    score.append(clf.score(Xtest,Ytest))
plt.plot(krange,score)
plt.show()

Determine best K value:

score.index(max(score))+1

But there will be problem at this time, if randomly divided dataset changes, the highest scoring K value will also change.

Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,y,test_size=0.2,random_state=421)
score = []
krange = range(1,20)
for i in krange:
    clf = KNeighborsClassifier(n_neighbors=i)
    clf = clf.fit(Xtrain,Ytrain)
    score.append(clf.score(Xtest,Ytest))
plt.plot(krange,score)
plt.show()

What is K when score is highest this time?

score.index(max(score))+1

At this time can’t determine optimal K value, can’t proceed with below modeling work, what to do?

Error Quick Reference

Symptom	Root Cause	Fix
Learning curve code directly reports IndentationError	for loop body not indented	Complete loop body indentation; ensure clf = … fit append in loop
Fails at “for i in krange:“	NameError: name ‘wine_data’ is not defined	Add data loading and DataFrame construction before text; or change to sklearn built-in dataset/CSV read example
Points to wine_data.iloc…	NameError: name ‘plt’ is not defined	Add import matplotlib.pyplot as plt
Points to plt.plot(…)	Score looks “very high/low” but not credible	Evaluate with independent test set (Xtest/Ytest); at least hundreds of samples or cross-validation mean
clf.score([[12.8,4.1]],[0]) such writing	”Best K” changes with random_state	Use StratifiedKFold + GridSearchCV; report mean ± variance/confidence interval, not single peak
Comparing random_state=420 vs 421, peak K different	KNN effect abnormal, K very large/small only good	Use Pipeline(StandardScaler(), KNN); if needed try different distance metrics
kneighbors result hard to understand or index doesn’t match original table	Xtest is ndarray, returned index is training set index (relative to Xtrain)	Clarify kneighbors(Xtest[[20,30]], …) returns Xtrain neighbor index; use these indexes to look back at Xtrain/Ytrain
Text comment inconsistent with code	Comment says “30% data as training set”, but test_size=0.2 actually 20% test set	Follow code; unify comment (train/test ratio) to avoid misleading readers
KNeighborsClassifier slow/high memory	Under large samples, KNN prediction is neighbor search, high complexity	Dimensionality reduction (PCA), approximate neighbors (external library), reduce features/samples; or change model (linear/tree model)