Big Data 198 - KNN Must Normalize First
TL;DR
- Scenario: Distance-based models like KNN have distorted effect on data with inconsistent scales, and normalization order easily introduces data leakage
- Conclusion: First split dataset; only use training set to fit normalization parameters; test set only does transform; cross-validation uses Pipeline to be stable
- Output: A set of reusable MinMaxScaler + KNN (with distance weights) engineering workflow and troubleshooting checklist
Normalization
Distance-based Model Normalization Requirements
Let’s put X in dataframe to see, do you observe each feature value mean differs greatly? Some feature values large, some small, this phenomenon in machine learning is called “inconsistent scales”. KNN is distance-based model, Euclidean distance calculation has sum of squares on features.
If some feature Xi has very large value, other feature values don’t compare to it, then distance size is largely determined by this Xi, other features’ distances may not have much effect on d(A,B) size. This phenomenon will greatly reduce effect of distance-based models like KNN.
However, in actual analysis situations, most datasets have features with different scales. At this time, if using KNN classifier, need to first normalize dataset, i.e., compress all data to same range.
When data (X) is centered by minimum, then scaled by range (max-min), data moves by minimum units, and will converge to [0,1], this process is called data normalization (Normalization, also called Min-Max Scaling).
Split Dataset First, Then Normalize
Normalization Precautions in Machine Learning Data Preprocessing
Wrong Approach Analysis
Doing normalization on full dataset then doing cross-validation and drawing learning curve has serious problem, this operation causes data leakage, specifically:
- Wrong statistic calculation: When using full dataset to calculate normalization parameters (like minimum and range), these parameters actually contain future test set information
- Evaluation bias: This operation makes model perform exceptionally well in cross-validation, but this “good” is false because test set info has leaked to training process through normalization
Correct Approach Detail
Correct data preprocessing should follow these steps:
- Data split: First split complete dataset into training set and test set (usually 70:30 or 80:20 ratio)
- Only use training set to calculate normalization parameters: Calculate needed normalization statistics on training set (min, max, mean, std, etc.)
- Apply normalization:
- Use parameters calculated from training set to normalize training set itself
- Use same parameters to normalize test set (absolutely don’t recalculate normalization parameters on test set)
- Model training and evaluation: Do model training and cross-validation on normalized data
Importance in Real Business Scenarios
This specification is especially important in actual business because:
- Simulate real scenario: We can only build model based on historical data (training set), can’t know future data (test set) distribution in advance
- Avoid over-optimistic: Using wrong normalization method will cause model evaluation to be too optimistic, possibly selecting model that performs poorly in actual application
- Pipeline consistency: In production environment, when new data arrives, must also use normalization parameters determined during training stage to process
Example Explanation
Assume we have dataset with 1000 samples:
- First split into 700 training samples and 300 test samples
- On 700 training samples calculate minimum 10, maximum 90 (range 80)
- Use these parameters:
- Training set normalization: (each value-10)/80
- Test set normalization: (each value-10)/80 (even if test set has values 5 or 95)
- Only such evaluation results can truly reflect model performance on unknown data
data = [[-1,2],[-0.5,6],[0,10],[1,18]]
data=pd.DataFrame(data)
(data-np.min(data,axis=0))/(np.max(data,axis=0)-np.min(data,axis=0))
Implement Through sklearn
from sklearn.preprocessing import MinMaxScaler as mms
Xtrain,Xtest,Ytrain,Ytest=train_test_split(X,y,test_size=0.2,random_state=420)
# Normalization
# Get training set max/min
MMS_01=mms().fit(Xtrain)
# Get test set max/min
MMS_02=mms().fit(Xtest)
# Transform
X_train=MMS_01.transform(Xtrain)
X_test =MMS_02.transform(Xtest)
score=[]
var=[]
for i in range(1,20):
clf=KNeighborsClassifier(n_neighbors=i)
# Each cross validation score
cvresult=CVS(clf,X_train,Ytrain,cv=5)
score.append(cvresult.mean())
var.append(cvresult.var())
plt.plot(krange,score,color="k")
plt.plot(krange,np.array(score)+np.array(var)*2,c="red",linestyle="--")
plt.plot(krange,np.array(score)-np.array(var)*2,c="red",linestyle="--")
plt.show()
Distance Penalty
Distance-based correction of nearest neighbor points is an important optimization step in classifying unknown samples. Traditional KNN uses “one point one vote” simple voting mechanism: after selecting K nearest neighbors, count these neighbors’ category distribution, each neighbor has same influence on classification result.
However, this simple voting mechanism has obvious flaw. Actually, even among K nearest points, their distances to target sample vary significantly. According to basic assumption of KNN model - similar samples have similar category attributes - neighbors closer to target sample have higher probability of belonging to same category. Therefore, closer neighbors should have larger voting weight than farther neighbors.
Common weighting methods:
- Inverse distance weighting: weight=1/(distance+ε)
- Gaussian weighting: weight=exp(-distance²/σ²)
- Linear weighting: weight=(max distance-distance)/(max distance-min distance)
This distance-weighted correction can better reflect actual local sample distribution, improve classification accuracy, especially more effective when sample distribution is uneven or has noise.
In sklearn, can control whether to use distance as penalty factor through weights parameter:
for i in range(1,20):
clf=KNeighborsClassifier(n_neighbors=i,weights='distance')
# Each cross validation score
cvresult=CVS(clf,X_train,Ytrain,cv=5)
score.append(cvresult.mean())
var.append(cvresult.var())
plt.plot(krange,score,color="k")
plt.plot(krange,np.array(score)+np.array(var)*2,c="red",linestyle="--")
plt.plot(krange,np.array(score)-np.array(var)*2,c="red",linestyle="--")
plt.show()
Error Quick Reference
| Symptom | Root Cause | Diagnosis Method | Fix |
|---|---|---|---|
| CV score abnormally high, online significantly drops | Fit normalization parameters on full X, causing data leakage | Check if scaler fit before split/CV | First split; only fit on training set; CV uses Pipeline (scaler+model) |
| Test set performance unstable, large fluctuation with different random_state | Test set fit scaler separately (training/test scaling standard inconsistent) | Code shows mms().fit(Xtest) | Test set only uses training set scaler’s transform, not allowed to fit Xtest again |
| Training/test map to different [0,1] intervals | Fit two MinMaxScalers separately | Find both MMS_01 and MMS_02 exist and both fit | Keep only one scaler: fit(Xtrain); both transform |
| Optimal K conclusion inconsistent before/after (7/8/6 mixed) | Text and code n_neighbors, index calculation not synchronized | Article says “final value is 7/best K is 8”, but modeling uses n_neighbors=6 | Unify口径: use same score list and same n_neighbors to reproduce final conclusion |
| Plot or loop error (undefined variable/length mismatch) | krange undefined or inconsistent with score length; score/var not cleared reused causing NameError or plot dimension error | Clarify krange=range(1,20); before each experiment score=[]; var=[] | - |
| Weighted KNN looks improved but not reproducible | Weighted strategy sensitive to outliers/noise, and evaluation process non-standard | Compare different splits, different CV results vary greatly | Use fixed evaluation protocol: Pipeline + StratifiedKFold; report mean and variance |
| New online data shows <0 or >1 scaled result misjudged as error | Test/online data exceeding training set min/max is normal phenomenon | Find negative or greater than 1 after transform | Keep training period parameters unchanged; if needed do outlier strategy (truncation/robust scaling) and evaluate with validation set |