Classification Metrics

Classification metrics help us evaluate model performance beyond simple accuracy. This demo shows how different metrics respond to threshold changes and why accuracy alone can be misleading, especially with imbalanced data.

In civil engineering, choosing the right metric is critical for safety-critical applications like structural failure prediction, where missing a dangerous case (false negative) can be catastrophic.

• Adjust the threshold slider to change the classification boundary and observe how all metrics respond
• Select different datasets to explore balanced vs. imbalanced scenarios
• Watch the confusion matrix update in real-time
• Hover over confusion matrix cells to highlight corresponding points on the plot

Can you find a threshold where accuracy is high but sensitivity is terrible? What does this mean for real-world applications?

• Threshold trade-off: Lower threshold → higher sensitivity, lower specificity
• Accuracy paradox: With imbalanced data, always predicting the majority class gives high accuracy but fails the objective
• Melanoma scenario: Shows why 99% accuracy can be useless (0% sensitivity)
• Bridge safety: False negatives (missed failures) are more costly than false positives
• Balanced accuracy: Averages sensitivity and specificity, unaffected by class imbalance
• F1-score: Harmonic mean of precision and recall, balances both error types

Decision Threshold

0.50

Sensitivity

81.0%

TP / (TP + FN)

Specificity

80.5%

TN / (TN + FP)

Precision (PPV)

80.6%

TP / (TP + FP)

Accuracy

80.8%

(TP + TN) / Total

F1-Score

80.8%

2·P·R / (P + R)

Bal. Accuracy

80.8%

(Sens + Spec) / 2

Dataset Scenario

50/50

99.5/0.5

95/5

	Predicted Positive	Predicted Negative
True Positive	TP 162	FN 38
True Negative	FP 39	TN 161

Understanding Classification Metrics

The Confusion Matrix is the foundation of all classification metrics. It breaks down predictions into four categories based on true vs. predicted class:

• True Positives (TP): Correctly predicted positive class
• True Negatives (TN): Correctly predicted negative class
• False Positives (FP): Incorrectly predicted positive (Type I error, "false alarm")
• False Negatives (FN): Incorrectly predicted negative (Type II error, "missed detection")

Key Metrics Derived from the Confusion Matrix:

1. Sensitivity (Recall, TPR): Fraction of actual positives correctly identified. Formula: $Sensitivity = \frac{T P}{T P + F N}$ High sensitivity means few false negatives—critical in medical diagnosis and structural safety where missing positive cases is dangerous.

2. Specificity (TNR): Fraction of actual negatives correctly identified. Formula: $Specificity = \frac{T N}{T N + F P}$ High specificity means few false positives—important when intervention costs are high and false alarms waste resources.

3. Precision (PPV): Fraction of positive predictions that are correct. Formula: $Precision = \frac{T P}{T P + F P}$ High precision means when the model says "positive", it's usually right—key when acting on predictions is expensive.

4. Accuracy: Overall fraction of correct predictions. Formula: $Accuracy = \frac{T P + T N}{T P + T N + F P + F N}$ While intuitive, accuracy is misleading with imbalanced classes. A model predicting only the majority class can achieve high accuracy while failing its purpose.

5. F1-Score: Harmonic mean of precision and recall. Formula: $F_{1} = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}$ Provides a single metric balancing both precision and recall. Useful when both error types have comparable costs.

6. Balanced Accuracy: Average of sensitivity and specificity. Formula: $Balanced Accuracy = \frac{Sensitivity + Specificity}{2}$ Unlike standard accuracy, balanced accuracy treats both classes equally regardless of their frequency—essential for imbalanced datasets.

The Accuracy Paradox: With the melanoma dataset (99.5% benign), a classifier that always predicts "benign" achieves 99.5% accuracy but 0% sensitivity—it misses all cancers! This demonstrates why accuracy alone is insufficient. Balanced accuracy (50%) and F1-score (0%) correctly reveal this model's failure.

Threshold Selection Trade-off: Moving the decision threshold creates a fundamental trade-off between sensitivity and specificity. Lower thresholds predict more positives (higher sensitivity, lower specificity); higher thresholds predict fewer positives (lower sensitivity, higher specificity). The optimal threshold depends on the relative costs of false positives vs. false negatives in your application.