Big Data 204 - Confusion Matrix to ROC
TL;DR
- Scenario: Accuracy distorts when classes imbalanced, business more cares about “minority class capture” and “misjudge cost”
- Conclusion: Use confusion matrix for unified口径, then use Precision/Recall/F1 and ROC/AUC for tradeoff and threshold selection
- Output: TP/FP/FN/TN definition and metric mapping + sklearn metric/drawing interface version points + common pitfall quick reference
Version Matrix
| Item/Interface | Verified | Description |
|---|---|---|
| scikit-learn | ✅ 1.8.0 | PyPI latest stable release, release date 2025-12-10; Requires: Python ≥3.11 |
| Python | ✅ ≥3.11 | sklearn 1.8.0 explicitly requires Python ≥3.11 |
| sklearn.metrics.confusion_matrix | ✅ 1.8.0 | Docs normalize={‘true’,‘pred’,‘all’}: normalize by true/pred/all |
| sklearn.metrics.ConfusionMatrixDisplay | ✅ 1.8.0 | Docs recommend using from_estimator/from_predictions to generate display object |
| sklearn.metrics.roc_curve | ✅ 1.8.0 | Docs signature includes pos_label, input is y_score (score/probability), not discrete labels |
| sklearn.metrics.RocCurveDisplay | ✅ 1.8.0 | Docs generate ROC display directly from estimator |
Confusion Matrix
From previous section example, if our goal is to try to capture minority class, accuracy as model evaluation gradually fails, so we need new model evaluation metrics. If simply, we just need to check model’s accuracy on minority class, as long as can capture minority class as much as possible, can achieve our goal.
But at this time, new problem appears: after misjudging majority class, will need manual screening or more business measures to exclude misjudged majority class, this behavior often comes with high cost.
For example, when bank judges whether a customer applying for credit card will default, if a customer is judged to default, that customer’s credit card application will be rejected. If to capture people who will default,大量地将不会违约的客户判断为会违约的客户, then many innocent customers’ applications will be rejected.
In other words, simply pursuing capturing minority class costs too high, while not caring about minority class can’t achieve model effect. So in reality, we often look for balance between ability to capture minority class and cost of misjudging majority class. If a model can try to capture minority class while also trying to be correct on majority class, this model is very good. To evaluate such ability, introduce new model evaluation metric: confusion matrix can help.
- Confusion matrix is a multi-dimensional metric system for binary classification, extremely useful when samples imbalanced
- In confusion matrix, treat minority class as positive example, majority class as negative example
- In decision tree, random forest algorithms, minority class is 1, majority class is 0
- In SVM, minority class is 1, majority class is -1
In ordinary confusion, generally use “0,1” to represent, confusion matrix as name is very easy to confuse, various names and definitions in textbooks make it difficult to understand and remember.
Four Elements of Confusion Matrix
Where:
- Rows represent predictions, columns represent actual
- Predicted value is 1, recorded as P (Positive)
- Predicted value is 0, recorded as N (Negative)
- Predicted value same as actual, recorded as T (True)
- Predicted value opposite of actual, recorded as F (False)
Therefore four elements in matrix represent:
- TP (True Positive): Actual is 1, predicted is 1
- FN (False Negative): Actual is 1, predicted is 0
- FP (False Positive): Actual is 0, predicted is 1
- TN (True Negative): Actual is 0, predicted is 0
Based on confusion matrix, we have series of different model evaluation metrics, all range between [0,1], so metrics with 1 and 0 as numerator are better closer to 1, metrics with 0 and 1 as numerator are better closer to 0.
Accuracy
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision
Precision, also called 查准率, represents proportion of actual 1 in all samples with predicted result 1. Lower Precision means 01 proportion is large, meaning model has higher misjudgment rate on majority class 0, misjudging too many majority class. To avoid misjudging majority class, need to pursue high Precision.
Precision is measurement of cost after misjudging majority class.
Recall
Recall, also called sensitivity (sensitivity), true positive rate, 查全率, represents proportion of samples correctly predicted among all samples actually 1.
Higher Recall means we captured as many minority class as possible. Lower Recall means we didn’t capture enough minority class.
We want to find minority class regardless of cost (like escaped criminals), then pursue high Recall. Conversely, if our goal is not to capture minority class, then don’t need to care about Recall.
Note numerator of Recall and Precision is same (both 11), just denominators different.
Recall and Precision are inversely related, balance between them represents balance between need to capture minority class and need to not misjudge majority class.
Which side to favor depends on business requirement: is cost of misjudging majority class higher, or cost of not capturing minority class higher.
F1 Measure
To balance Precision and Recall, we created harmonic mean of both as comprehensive metric considering balance, called F1 Measure.
Harmonic mean between two numbers tends toward smaller number, so pursuing high F1 Measure ensures both Precision and Recall are relatively high.
F1 Measure distributed between [0,1], closer to 1 is better.
False Negative Rate
Another evaluation metric derived from Recall is False Negative Rate, equals 1 - Recall, used to measure.
Among all samples actually 1, ones incorrectly judged as 0, usually not used much.
ROC Curve
ROC full name is: Receiver Operating characteristic Curve, its main analysis method is drawing this characteristic curve.
Curve’s horizontal axis is False Positive Rate (FPR), N is number of actual negative samples, FP is number of negative samples predicted as positive by classifier.
Vertical axis is Recall, True Positive Rate (TPR): P is number of actual positive samples, TP is number of positive samples predicted as positive by classifier.
Confusion Matrix in sklearn
Decision Tree Algorithm Evaluation
Decision Tree Advantages Detailed
-
Easy to understand and interpret
- Decision tree uses tree structure to display decision process, high visualization, non-professionals can understand
- Each node represents a feature judgment, branches represent judgment results, leaf nodes represent final decisions
- Example: In bank loan approval scenario, can intuitively see decision path like “income > 50k → good credit → approve”
-
Low data preprocessing requirements
- No need for feature scaling or standardization (like decision tree insensitive to feature value ranges)
- Can directly process categorical features, no need for one-hot encoding (but sklearn implementation needs)
- Note: Mainstream implementations (like sklearn) don’t support automatic missing value processing, need to fill in advance
-
High prediction efficiency
- Prediction time complexity O(log n), n is number of training samples
- Comparison: KNN prediction needs O(n), SVM needs O(n_sv) support vector count
- Suitable for real-time prediction scenarios, like financial risk control system needing millisecond response
-
Multi-type data processing ability
- Can mixedly process numerical features (like age, income) and categorical features (like gender, occupation)
- Can be used for both classification (predict category labels) and regression (predict continuous values)
- Application scenario example: House price prediction (regression) and customer churn prediction (classification) can use same algorithm
-
Model robustness
- Makes fewer assumptions about data distribution, can handle non-linear relationships between features well
- Relatively insensitive to outliers, because split basis is feature sorting not absolute values
Decision Tree Disadvantages Detailed
-
Overfitting risk
- Easy to grow overly complex tree, perfectly fitting training data but poor generalization
- Solutions:
- Pre-pruning: Set max_depth (maximum depth), min_samples_leaf (minimum samples per leaf node)
- Post-pruning: After training complete tree, prune unimportant branches
- Example: Limiting max_depth=5 usually balances model complexity and performance
-
Model instability
- Small changes in training data may cause completely different tree structures
- Root cause: Greedy algorithm’s local optimal characteristic
- Improvement methods:
- Use ensemble learning (like Random Forest) to reduce variance through multiple tree voting
- In Random Forest, set max_features parameter to control feature sampling randomness
-
Local optimal problem
- Each split only considers current node’s optimal solution, can’t backtrack to adjust previous splits
- May lead to suboptimal global model
- Solutions:
- Ensemble methods: Combine multiple weak learners through Bagging or Boosting
- Feature engineering: Manually construct more discriminative features
-
Sensitive to class imbalance
- When one class sample proportion is too large, model biases toward that class
- Solutions:
- Oversample minority class or undersample majority class
- Use class weight parameter (like class_weight=‘balanced’)
- Use evaluation metrics like AUC that are insensitive to class imbalance
- Example: In fraud detection, when normal transactions account for 99%, need special handling of imbalance
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| Confusion matrix “row/column meaning” doesn’t match (looks like TP/FP swapped) | Axis meaning reversed (actual/prediction swapped) or drawing tool default order misread | Use 3-5 manual examples to calculate TP/FP/FN/TN once, compare with matrix four cells to clarify confusion_matrix(y_true, y_pred) semantics; when displaying, uniformly annotate “actual/prediction”; if needed transpose but synchronize口径 |
| Precision/Recall values abnormal (Precision very low or Recall very low not matching intuition) | Positive label pos_label inconsistent (0/1 vs -1/1 vs strings); labels sorting inconsistent with business positive | Print np.unique(y_true) and np.unique(y_pred); check if “minority class truly treated as positive class” in ROC/binary classification explicitly set pos_label; in confusion matrix explicitly set labels=[neg,pos] to maintain business order |
| ROC curve “like a straight line/few points/opposite of expectation” | Treated y_pred (0/1 labels) as y_score; or score direction reversed (positive class score smaller) | Check if second parameter passed to roc_curve only has {0,1}; check if positive class mean score higher pass predict_proba[:,1] or decision_function; if needed invert score or fix positive class definition (pos_label) |
| Normalized confusion matrix “percentage wrong” (each row doesn’t sum to 1 or each column doesn’t sum to 1) | Misunderstood denominator of normalize=‘true’/‘pred’/‘all’ | Observe after normalization if “row sum=1” or “column sum=1” need “Recall perspective” use normalize=‘true’; need “Precision perspective” use normalize=‘pred’; for overall proportion use normalize=‘all’ |
| Accuracy high but business effect poor (missing key minority class) | Class imbalance causes majority class to dominate; threshold default 0.5 doesn’t match cost function | Check if FN in confusion matrix is large; check if Recall/TPR is low prioritize Precision/Recall/F1 or ROC/AUC; adjust threshold based on business cost, and in evaluation report fix “positive class definition + threshold” |
| After improving “capture minority class”, misjudging majority class surges (review/rejection volume explodes) | Single-point pursue Recall, ignore FP cost | Check FP and Precision change trend use Precision as cost constraint, do threshold scan/PR tradeoff; in launch metrics simultaneously lock lower limit of Precision and lower limit of Recall |