Big Data 204 - Confusion Matrix to ROC

TL;DR

Scenario: Accuracy distorts when classes imbalanced, business more cares about “minority class capture” and “misjudge cost”
Conclusion: Use confusion matrix for unified口径, then use Precision/Recall/F1 and ROC/AUC for tradeoff and threshold selection
Output: TP/FP/FN/TN definition and metric mapping + sklearn metric/drawing interface version points + common pitfall quick reference

Version Matrix

Item/Interface	Verified	Description
scikit-learn	✅ 1.8.0	PyPI latest stable release, release date 2025-12-10; Requires: Python ≥3.11
Python	✅ ≥3.11	sklearn 1.8.0 explicitly requires Python ≥3.11
sklearn.metrics.confusion_matrix	✅ 1.8.0	Docs normalize={‘true’,‘pred’,‘all’}: normalize by true/pred/all
sklearn.metrics.ConfusionMatrixDisplay	✅ 1.8.0	Docs recommend using from_estimator/from_predictions to generate display object
sklearn.metrics.roc_curve	✅ 1.8.0	Docs signature includes pos_label, input is y_score (score/probability), not discrete labels
sklearn.metrics.RocCurveDisplay	✅ 1.8.0	Docs generate ROC display directly from estimator

Confusion Matrix

From previous section example, if our goal is to try to capture minority class, accuracy as model evaluation gradually fails, so we need new model evaluation metrics. If simply, we just need to check model’s accuracy on minority class, as long as can capture minority class as much as possible, can achieve our goal.

But at this time, new problem appears: after misjudging majority class, will need manual screening or more business measures to exclude misjudged majority class, this behavior often comes with high cost.

For example, when bank judges whether a customer applying for credit card will default, if a customer is judged to default, that customer’s credit card application will be rejected. If to capture people who will default,大量地将不会违约的客户判断为会违约的客户, then many innocent customers’ applications will be rejected.

In other words, simply pursuing capturing minority class costs too high, while not caring about minority class can’t achieve model effect. So in reality, we often look for balance between ability to capture minority class and cost of misjudging majority class. If a model can try to capture minority class while also trying to be correct on majority class, this model is very good. To evaluate such ability, introduce new model evaluation metric: confusion matrix can help.

Confusion matrix is a multi-dimensional metric system for binary classification, extremely useful when samples imbalanced
In confusion matrix, treat minority class as positive example, majority class as negative example
In decision tree, random forest algorithms, minority class is 1, majority class is 0
In SVM, minority class is 1, majority class is -1

In ordinary confusion, generally use “0,1” to represent, confusion matrix as name is very easy to confuse, various names and definitions in textbooks make it difficult to understand and remember.

Four Elements of Confusion Matrix

Where:

Rows represent predictions, columns represent actual
Predicted value is 1, recorded as P (Positive)
Predicted value is 0, recorded as N (Negative)
Predicted value same as actual, recorded as T (True)
Predicted value opposite of actual, recorded as F (False)

Therefore four elements in matrix represent:

TP (True Positive): Actual is 1, predicted is 1
FN (False Negative): Actual is 1, predicted is 0
FP (False Positive): Actual is 0, predicted is 1
TN (True Negative): Actual is 0, predicted is 0

Based on confusion matrix, we have series of different model evaluation metrics, all range between [0,1], so metrics with 1 and 0 as numerator are better closer to 1, metrics with 0 and 1 as numerator are better closer to 0.

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision

Precision, also called 查准率, represents proportion of actual 1 in all samples with predicted result 1. Lower Precision means 01 proportion is large, meaning model has higher misjudgment rate on majority class 0, misjudging too many majority class. To avoid misjudging majority class, need to pursue high Precision.

Precision is measurement of cost after misjudging majority class.

Recall

Recall, also called sensitivity (sensitivity), true positive rate, 查全率, represents proportion of samples correctly predicted among all samples actually 1.

Higher Recall means we captured as many minority class as possible. Lower Recall means we didn’t capture enough minority class.

We want to find minority class regardless of cost (like escaped criminals), then pursue high Recall. Conversely, if our goal is not to capture minority class, then don’t need to care about Recall.

Note numerator of Recall and Precision is same (both 11), just denominators different.

Recall and Precision are inversely related, balance between them represents balance between need to capture minority class and need to not misjudge majority class.

Which side to favor depends on business requirement: is cost of misjudging majority class higher, or cost of not capturing minority class higher.

F1 Measure

To balance Precision and Recall, we created harmonic mean of both as comprehensive metric considering balance, called F1 Measure.

Harmonic mean between two numbers tends toward smaller number, so pursuing high F1 Measure ensures both Precision and Recall are relatively high.

F1 Measure distributed between [0,1], closer to 1 is better.

False Negative Rate

Another evaluation metric derived from Recall is False Negative Rate, equals 1 - Recall, used to measure.

Among all samples actually 1, ones incorrectly judged as 0, usually not used much.

ROC Curve

ROC full name is: Receiver Operating characteristic Curve, its main analysis method is drawing this characteristic curve.

Curve’s horizontal axis is False Positive Rate (FPR), N is number of actual negative samples, FP is number of negative samples predicted as positive by classifier.

Vertical axis is Recall, True Positive Rate (TPR): P is number of actual positive samples, TP is number of positive samples predicted as positive by classifier.

Confusion Matrix in sklearn

Decision Tree Algorithm Evaluation

Decision Tree Advantages Detailed

Easy to understand and interpret
- Decision tree uses tree structure to display decision process, high visualization, non-professionals can understand
- Each node represents a feature judgment, branches represent judgment results, leaf nodes represent final decisions
- Example: In bank loan approval scenario, can intuitively see decision path like “income > 50k → good credit → approve”
Low data preprocessing requirements
- No need for feature scaling or standardization (like decision tree insensitive to feature value ranges)
- Can directly process categorical features, no need for one-hot encoding (but sklearn implementation needs)
- Note: Mainstream implementations (like sklearn) don’t support automatic missing value processing, need to fill in advance
High prediction efficiency
- Prediction time complexity O(log n), n is number of training samples
- Comparison: KNN prediction needs O(n), SVM needs O(n_sv) support vector count
- Suitable for real-time prediction scenarios, like financial risk control system needing millisecond response
Multi-type data processing ability
- Can mixedly process numerical features (like age, income) and categorical features (like gender, occupation)
- Can be used for both classification (predict category labels) and regression (predict continuous values)
- Application scenario example: House price prediction (regression) and customer churn prediction (classification) can use same algorithm
Model robustness
- Makes fewer assumptions about data distribution, can handle non-linear relationships between features well
- Relatively insensitive to outliers, because split basis is feature sorting not absolute values

Decision Tree Disadvantages Detailed

Overfitting risk
- Easy to grow overly complex tree, perfectly fitting training data but poor generalization
- Solutions:
  - Pre-pruning: Set max_depth (maximum depth), min_samples_leaf (minimum samples per leaf node)
  - Post-pruning: After training complete tree, prune unimportant branches
  - Example: Limiting max_depth=5 usually balances model complexity and performance
Model instability
- Small changes in training data may cause completely different tree structures
- Root cause: Greedy algorithm’s local optimal characteristic
- Improvement methods:
  - Use ensemble learning (like Random Forest) to reduce variance through multiple tree voting
  - In Random Forest, set max_features parameter to control feature sampling randomness
Local optimal problem
- Each split only considers current node’s optimal solution, can’t backtrack to adjust previous splits
- May lead to suboptimal global model
- Solutions:
  - Ensemble methods: Combine multiple weak learners through Bagging or Boosting
  - Feature engineering: Manually construct more discriminative features
Sensitive to class imbalance
- When one class sample proportion is too large, model biases toward that class
- Solutions:
  - Oversample minority class or undersample majority class
  - Use class weight parameter (like class_weight=‘balanced’)
  - Use evaluation metrics like AUC that are insensitive to class imbalance
- Example: In fraud detection, when normal transactions account for 99%, need special handling of imbalance

Error Quick Reference

Symptom	Root Cause	Fix
Confusion matrix “row/column meaning” doesn’t match (looks like TP/FP swapped)	Axis meaning reversed (actual/prediction swapped) or drawing tool default order misread	Use 3-5 manual examples to calculate TP/FP/FN/TN once, compare with matrix four cells to clarify confusion_matrix(y_true, y_pred) semantics; when displaying, uniformly annotate “actual/prediction”; if needed transpose but synchronize口径
Precision/Recall values abnormal (Precision very low or Recall very low not matching intuition)	Positive label pos_label inconsistent (0/1 vs -1/1 vs strings); labels sorting inconsistent with business positive	Print np.unique(y_true) and np.unique(y_pred); check if “minority class truly treated as positive class” in ROC/binary classification explicitly set pos_label; in confusion matrix explicitly set labels=[neg,pos] to maintain business order
ROC curve “like a straight line/few points/opposite of expectation”	Treated y_pred (0/1 labels) as y_score; or score direction reversed (positive class score smaller)	Check if second parameter passed to roc_curve only has {0,1}; check if positive class mean score higher pass predict_proba[:,1] or decision_function; if needed invert score or fix positive class definition (pos_label)
Normalized confusion matrix “percentage wrong” (each row doesn’t sum to 1 or each column doesn’t sum to 1)	Misunderstood denominator of normalize=‘true’/‘pred’/‘all’	Observe after normalization if “row sum=1” or “column sum=1” need “Recall perspective” use normalize=‘true’; need “Precision perspective” use normalize=‘pred’; for overall proportion use normalize=‘all’
Accuracy high but business effect poor (missing key minority class)	Class imbalance causes majority class to dominate; threshold default 0.5 doesn’t match cost function	Check if FN in confusion matrix is large; check if Recall/TPR is low prioritize Precision/Recall/F1 or ROC/AUC; adjust threshold based on business cost, and in evaluation report fix “positive class definition + threshold”
After improving “capture minority class”, misjudging majority class surges (review/rejection volume explodes)	Single-point pursue Recall, ignore FP cost	Check FP and Precision change trend use Precision as cost constraint, do threshold scan/PR tradeoff; in launch metrics simultaneously lock lower limit of Precision and lower limit of Recall