Big Data 198 - KNN Must Normalize First

TL;DR

  • Scenario: Distance-based models like KNN have distorted effect on data with inconsistent scales, and normalization order easily introduces data leakage
  • Conclusion: First split dataset; only use training set to fit normalization parameters; test set only does transform; cross-validation uses Pipeline to be stable
  • Output: A set of reusable MinMaxScaler + KNN (with distance weights) engineering workflow and troubleshooting checklist

Normalization

Distance-based Model Normalization Requirements

Let’s put X in dataframe to see, do you observe each feature value mean differs greatly? Some feature values large, some small, this phenomenon in machine learning is called “inconsistent scales”. KNN is distance-based model, Euclidean distance calculation has sum of squares on features.

If some feature Xi has very large value, other feature values don’t compare to it, then distance size is largely determined by this Xi, other features’ distances may not have much effect on d(A,B) size. This phenomenon will greatly reduce effect of distance-based models like KNN.

However, in actual analysis situations, most datasets have features with different scales. At this time, if using KNN classifier, need to first normalize dataset, i.e., compress all data to same range.

When data (X) is centered by minimum, then scaled by range (max-min), data moves by minimum units, and will converge to [0,1], this process is called data normalization (Normalization, also called Min-Max Scaling).

Split Dataset First, Then Normalize

Normalization Precautions in Machine Learning Data Preprocessing

Wrong Approach Analysis

Doing normalization on full dataset then doing cross-validation and drawing learning curve has serious problem, this operation causes data leakage, specifically:

  1. Wrong statistic calculation: When using full dataset to calculate normalization parameters (like minimum and range), these parameters actually contain future test set information
  2. Evaluation bias: This operation makes model perform exceptionally well in cross-validation, but this “good” is false because test set info has leaked to training process through normalization
Correct Approach Detail

Correct data preprocessing should follow these steps:

  1. Data split: First split complete dataset into training set and test set (usually 70:30 or 80:20 ratio)
  2. Only use training set to calculate normalization parameters: Calculate needed normalization statistics on training set (min, max, mean, std, etc.)
  3. Apply normalization:
    • Use parameters calculated from training set to normalize training set itself
    • Use same parameters to normalize test set (absolutely don’t recalculate normalization parameters on test set)
  4. Model training and evaluation: Do model training and cross-validation on normalized data
Importance in Real Business Scenarios

This specification is especially important in actual business because:

  1. Simulate real scenario: We can only build model based on historical data (training set), can’t know future data (test set) distribution in advance
  2. Avoid over-optimistic: Using wrong normalization method will cause model evaluation to be too optimistic, possibly selecting model that performs poorly in actual application
  3. Pipeline consistency: In production environment, when new data arrives, must also use normalization parameters determined during training stage to process
Example Explanation

Assume we have dataset with 1000 samples:

  1. First split into 700 training samples and 300 test samples
  2. On 700 training samples calculate minimum 10, maximum 90 (range 80)
  3. Use these parameters:
    • Training set normalization: (each value-10)/80
    • Test set normalization: (each value-10)/80 (even if test set has values 5 or 95)
  4. Only such evaluation results can truly reflect model performance on unknown data
data = [[-1,2],[-0.5,6],[0,10],[1,18]]
data=pd.DataFrame(data)
(data-np.min(data,axis=0))/(np.max(data,axis=0)-np.min(data,axis=0))

Implement Through sklearn

from sklearn.preprocessing import MinMaxScaler as mms

Xtrain,Xtest,Ytrain,Ytest=train_test_split(X,y,test_size=0.2,random_state=420)
# Normalization
# Get training set max/min
MMS_01=mms().fit(Xtrain)
# Get test set max/min
MMS_02=mms().fit(Xtest)

# Transform
X_train=MMS_01.transform(Xtrain)
X_test =MMS_02.transform(Xtest)

score=[]
var=[]
for i in range(1,20):
    clf=KNeighborsClassifier(n_neighbors=i)
    # Each cross validation score
    cvresult=CVS(clf,X_train,Ytrain,cv=5)
    score.append(cvresult.mean())
    var.append(cvresult.var())

plt.plot(krange,score,color="k")
plt.plot(krange,np.array(score)+np.array(var)*2,c="red",linestyle="--")
plt.plot(krange,np.array(score)-np.array(var)*2,c="red",linestyle="--")
plt.show()

Distance Penalty

Distance-based correction of nearest neighbor points is an important optimization step in classifying unknown samples. Traditional KNN uses “one point one vote” simple voting mechanism: after selecting K nearest neighbors, count these neighbors’ category distribution, each neighbor has same influence on classification result.

However, this simple voting mechanism has obvious flaw. Actually, even among K nearest points, their distances to target sample vary significantly. According to basic assumption of KNN model - similar samples have similar category attributes - neighbors closer to target sample have higher probability of belonging to same category. Therefore, closer neighbors should have larger voting weight than farther neighbors.

Common weighting methods:

  1. Inverse distance weighting: weight=1/(distance+ε)
  2. Gaussian weighting: weight=exp(-distance²/σ²)
  3. Linear weighting: weight=(max distance-distance)/(max distance-min distance)

This distance-weighted correction can better reflect actual local sample distribution, improve classification accuracy, especially more effective when sample distribution is uneven or has noise.

In sklearn, can control whether to use distance as penalty factor through weights parameter:

for i in range(1,20):
    clf=KNeighborsClassifier(n_neighbors=i,weights='distance')
    # Each cross validation score
    cvresult=CVS(clf,X_train,Ytrain,cv=5)
    score.append(cvresult.mean())
    var.append(cvresult.var())
plt.plot(krange,score,color="k")
plt.plot(krange,np.array(score)+np.array(var)*2,c="red",linestyle="--")
plt.plot(krange,np.array(score)-np.array(var)*2,c="red",linestyle="--")
plt.show()

Error Quick Reference

SymptomRoot CauseDiagnosis MethodFix
CV score abnormally high, online significantly dropsFit normalization parameters on full X, causing data leakageCheck if scaler fit before split/CVFirst split; only fit on training set; CV uses Pipeline (scaler+model)
Test set performance unstable, large fluctuation with different random_stateTest set fit scaler separately (training/test scaling standard inconsistent)Code shows mms().fit(Xtest)Test set only uses training set scaler’s transform, not allowed to fit Xtest again
Training/test map to different [0,1] intervalsFit two MinMaxScalers separatelyFind both MMS_01 and MMS_02 exist and both fitKeep only one scaler: fit(Xtrain); both transform
Optimal K conclusion inconsistent before/after (7/8/6 mixed)Text and code n_neighbors, index calculation not synchronizedArticle says “final value is 7/best K is 8”, but modeling uses n_neighbors=6Unify口径: use same score list and same n_neighbors to reproduce final conclusion
Plot or loop error (undefined variable/length mismatch)krange undefined or inconsistent with score length; score/var not cleared reused causing NameError or plot dimension errorClarify krange=range(1,20); before each experiment score=[]; var=[]-
Weighted KNN looks improved but not reproducibleWeighted strategy sensitive to outliers/noise, and evaluation process non-standardCompare different splits, different CV results vary greatlyUse fixed evaluation protocol: Pipeline + StratifiedKFold; report mean and variance
New online data shows <0 or >1 scaled result misjudged as errorTest/online data exceeding training set min/max is normal phenomenonFind negative or greater than 1 after transformKeep training period parameters unchanged; if needed do outlier strategy (truncation/robust scaling) and evaluate with validation set