Big Data 197 - K-Fold Cross-Validation Practice

TL;DR

Scenario: Same model training evaluation fluctuates greatly, single split not trustworthy, K value hard to determine
Conclusion: Use K-fold cross-validation to see “average score + variance”, prioritize intervals with high mean and low variance
Output: A set of reproducible cross_val_score flow + stability parameter selection method combining learning curve

Cross-Validation

Background Problem

After determining K value, we can also observe an important phenomenon: every time running model, learning curve changes, model effect fluctuates, showing instability. This fluctuation is mainly caused by:

Randomness of dataset split:
- Each run, training and test set split differently (usually through random sampling)
- For example, assume we have 1000 data points, first may randomly select 700 as training set, remaining 300 as test set; second run, this split combination changes again
Impact of data distribution:
- Different training sets learn slightly different patterns
- Different test set composition causes evaluation metric fluctuations
- For example, if test set happens to contain more boundary cases, model performs worse

Problems in Real Business Scenarios

Difference between historical data and new data:
- Training data is usually static historical data
- Test data simulates real-time new data entering system
- In e-commerce recommendation system, train with past 3 months order data, but need to predict next week’s purchase behavior
Core goal of model evaluation:
- We pursue model performance on unknown data
- This ability is called Generalization Ability
- Good generalization ability means:
  - Remain robust to noisy data
  - Can handle unseen data patterns
  - Avoid overfitting specific characteristics of training data
Methods to improve generalization ability:
- Use cross-validation instead of single split
- Increase data diversity
- Use regularization techniques
- For example, in financial risk control models, use 5-fold cross-validation for more reliable model evaluation

Generalization Ability

When learning, we usually divide a sample set into [Training Set] and [Test Set], where training set is used for model learning and training, then test set is usually used to evaluate trained model’s predictive performance on data.

Training Error: Represents misclassified sample ratio on training set
Test Error: Represents misclassified sample ratio on test set

Training error size judges whether a given problem is easy to learn. Test error reflects model’s ability to predict unknown data. Learning method with small test error has good prediction ability. If obtained training set and test set have no intersection, this prediction ability is usually called Generalization Ability.

We believe, if model performs excellently on one set of training and test data, that proves nothing. Only if model performs excellently on many different training and test sets, model is stable, model truly has generalization ability in true sense.

For this, machine learning field has skill that plays divine role: [Cross-Validation], to help us understand model.

K-Fold Cross Validation

The most commonly used cross-validation method is K-Fold Cross Validation. This method divides dataset into K equal-sized mutually exclusive subsets, each time using K-1 subsets as training data, remaining 1 subset as validation data, repeats this process K times, finally obtaining average of K model evaluation results.

Specific Steps

Randomly divide original dataset into K equal-sized subsets (usually K=5 or 10)
Sequentially select i-th subset as validation set (i=1,2,…,K), remaining K-1 subsets merged as training set
Train model on training set, evaluate model performance on validation set
Repeat steps 2-3 until all subsets have served as validation set
Calculate average of K evaluation results as final model performance metric

Advantages

This method compared to simple training-test split has these advantages:

Fully utilize limited data resources
Reduce randomness impact from data splitting
Provide more reliable model performance evaluation
Especially suitable for small to medium-sized datasets

K Value Selection

In practical application, K value selection needs trade-off:

Smaller K value (like 5) has higher computational efficiency
Larger K value (like 10) has more stable evaluation results
Extreme case K=N (sample count) is leave-one-out cross validation

By multiple cross-validation to calculate mean, can significantly reduce evaluation bias from single data split, providing more reliable basis for model selection and parameter tuning.

Learning Curve with Cross Validation

For learning curve with cross validation, we need to observe not just highest accuracy, but points with high accuracy and relatively low variance, such points have strongest generalization ability.

Under cross validation + learning curve effect, selected hyperparameters can ensure better generalization ability.

Code Example

from sklearn.model_selection import cross_val_score as CVS

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, y, test_size=0.2, random_state=420)
clf = KNeighborsClassifier(n_neighbors=8)
# Training set folds 6 times, outputs 6 prediction rates in total
cvresult = CVS(clf, Xtrain, Ytrain, cv=6)
# Each cross validation run estimator score array
cvresult

Check Mean, Variance

# Mean: Check model's average effect
cvresult.mean()
# Variance: Check if model is stable
cvresult.var()

Draw Learning Curve

score = []
var = []
# Set different k values, look at range 1 to 19
krange = range(1, 20)
for i in krange:
    clf = KNeighborsClassifier(n_neighbors=i)
    cvresult = CVS(clf, Xtrain, Ytrain, cv=5)
    # Score array returned by each cross validation, then calculate array mean
    score.append(cvresult.mean())
    var.append(cvresult.var())

plt.plot(krange, score, color='k')
plt.plot(krange, np.array(score) + np.array(var) * 2, c='red', linestyle='--')
plt.plot(krange, np.array(score) - np.array(var) * 2, c='red', linestyle='--')

Whether Need Validation Set

Most standard, rigorous cross validation should have three sets of data: training set, validation set and test set. When we obtain a set of data:

First divide dataset into overall training set and test set
Then put training set into cross validation
Split training set into smaller training set (k-1 folds) and validation set (1 fold)
Returned cross validation results are actually results on validation set
Use validation set to find best parameters, confirm model we think has best generalization ability
Apply this model on test set, observe model’s performance

Usually, we think after finding final parameters through validation set, model’s generalization ability is enhanced, so model performs better on unknown data (test set). But awkward situation often occurs: after cross validation tuning on validation set, result on test set doesn’t improve.

Reason:

Our own split of training and test set affects model effect
After cross validation, model’s generalization ability enhanced, meaning it has smaller variance and higher average on unknown dataset, but can’t guarantee it has strongest prediction ability on currently split test set

If we believe cross validation adjustment result enhances model’s generalization ability, then even if result doesn’t improve (or even worsens), we also think model is successful.

If we don’t believe cross validation adjustment can enhance model’s generalization ability, and must rely on test set for judgment, we have no need to do cross validation at all, can directly use test set result to run learning curve.

So, whether need validation set actually has controversy, in rigorous case, people still use method with validation set.

Other Cross Validation Methods

Cross validation method not limited to “K-fold”, way to split training and test set also not limited to one kind, classified cross validation occupies a very long chapter in sklearn.

All cross validation is about splitting training and test set, just focusing on different aspects:

K-fold: Take training and test set in order
ShuffleSplit: Focus on making test set distributed throughout all data
StratifiedKFold: Believe training and test data must have same proportion in each label classification

Principles of various cross validation are cumbersome, on machine learning journey will gradually encounter harder cross validation, but essence remains same: cross validation is fundamentally to solve impact from training/test set split on model, while detecting model’s generalization ability.

Avoid Too Many Folds

Cross validation folds can’t be too many, because more folds means smaller sampled dataset, training data brings less information, model becomes more unstable.

If You Find

Without cross validation model performs well
Once using cross validation model effect drops sharply

Must Check:

Whether your labels have problems
Then check whether your data amount too small, folds too high

Problem of Too Many Folds

If change cv from 5 to 100:

score = []
var = []
krange = range(1, 20)
for i in krange:
    clf = KNeighborsClassifier(n_neighbors=i)
    cvresult = CVS(clf, Xtrain, Ytrain, cv=100)
    score.append(cvresult.mean())
    var.append(cvresult.var())

plt.plot(krange, score, color='k')
plt.plot(krange, np.array(score) + np.array(var) * 2, c='red', linestyle='--')
plt.plot(krange, np.array(score) - np.array(var) * 2, c='red', linestyle='--')

Too many folds leads to:

Slower computational efficiency
Prediction rate variance becomes larger, hard to ensure reaching expected prediction rate on new dataset

Common Problem Quick Reference

Symptom	Root Cause	Fix
Each run score varies greatly, curve up and down	Single random split causes high evaluation variance; more obvious with small data/imbalanced classes	Use K-fold cross-validation output mean/var; classification tasks prioritize StratifiedKFold
Worse on test set after cross validation	”Over-tuning” on validation fold; test set split itself biased	Strict three-stage: train+(CV tuning)+test; or use nested CV
After setting cv very large (like 100) score drops significantly and fluctuation increases	Each fold training set too small, model estimation unstable; computation explodes	Control cv (commonly 5 or 10)
Error: The least populated class in y has only 1 member	Under stratified cross validation, some class samples too few to split	Merge rare classes/add samples/reduce cv
CV score abnormally high, online effect very poor	Data leakage: did full standardization/feature selection/encoding before CV	Use Pipeline to put preprocessing into CV
Same code different results on different machines	Parallel/randomness/floating point differences; no seed fixed	Fix random_state; record dependency versions and hardware info
Metrics look “stable” but business not met	Wrong scoring selected (accuracy unsuitable for imbalanced); only look at mean not business constraints	Compare multiple metrics: precision/recall/F1/AUC