| Predicted Positive | Predicted Negative | |
| True Positive |
TP
0
|
FN
0
|
| True Negative |
FP
0
|
TN
0
|
The Confusion Matrix is the foundation of all classification metrics. It breaks down predictions into four categories based on true vs. predicted class:
• True Positives (TP): Correctly predicted positive class
• True Negatives (TN): Correctly predicted negative class
• False Positives (FP): Incorrectly predicted positive (Type I error, "false alarm")
• False Negatives (FN): Incorrectly predicted negative (Type II error, "missed detection")
Key Metrics Derived from the Confusion Matrix:
1. Sensitivity (Recall, TPR): Fraction of actual positives correctly identified. Formula: $$\text{Sensitivity} = \frac{TP}{TP + FN}$$
High sensitivity means few false negatives—critical in medical diagnosis and structural safety where missing positive cases is dangerous.
2. Specificity (TNR): Fraction of actual negatives correctly identified. Formula: $$\text{Specificity} = \frac{TN}{TN + FP}$$
High specificity means few false positives—important when intervention costs are high and false alarms waste resources.
3. Precision (PPV): Fraction of positive predictions that are correct. Formula: $$\text{Precision} = \frac{TP}{TP + FP}$$
High precision means when the model says "positive", it's usually right—key when acting on predictions is expensive.
4. Accuracy: Overall fraction of correct predictions. Formula: $$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$
While intuitive, accuracy is misleading with imbalanced classes. A model predicting only the majority class can achieve high accuracy while failing its purpose.
5. F1-Score: Harmonic mean of precision and recall. Formula: $$F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$
Provides a single metric balancing both precision and recall. Useful when both error types have comparable costs.
6. Balanced Accuracy: Average of sensitivity and specificity. Formula: $$\text{Balanced Accuracy} = \frac{\text{Sensitivity} + \text{Specificity}}{2}$$
Unlike standard accuracy, balanced accuracy treats both classes equally regardless of their frequency—essential for imbalanced datasets.
The Accuracy Paradox: With the melanoma dataset (99.5% benign), a classifier that always predicts "benign" achieves 99.5% accuracy but 0% sensitivity—it misses all cancers! This demonstrates why accuracy alone is insufficient. Balanced accuracy (50%) and F1-score (0%) correctly reveal this model's failure.
Threshold Selection Trade-off: Moving the decision threshold creates a fundamental trade-off between sensitivity and specificity. Lower thresholds predict more positives (higher sensitivity, lower specificity); higher thresholds predict fewer positives (lower sensitivity, higher specificity). The optimal threshold depends on the relative costs of false positives vs. false negatives in your application.