Classification metrics help us evaluate model performance beyond simple accuracy. This demo shows how different metrics respond to threshold changes and why accuracy alone can be misleading, especially with imbalanced data.

In civil engineering, choosing the right metric is critical for safety-critical applications like structural failure prediction, where missing a dangerous case (false negative) can be catastrophic.
• Adjust the threshold slider to change the classification boundary and observe how all metrics respond
• Select different datasets to explore balanced vs. imbalanced scenarios
• Watch the confusion matrix update in real-time
• Hover over confusion matrix cells to highlight corresponding points on the plot

Can you find a threshold where accuracy is high but sensitivity is terrible? What does this mean for real-world applications?
Threshold trade-off: Lower threshold → higher sensitivity, lower specificity
Accuracy paradox: With imbalanced data, always predicting the majority class gives high accuracy but fails the objective
Melanoma scenario: Shows why 99% accuracy can be useless (0% sensitivity)
Bridge safety: False negatives (missed failures) are more costly than false positives
Balanced accuracy: Averages sensitivity and specificity, unaffected by class imbalance
F1-score: Harmonic mean of precision and recall, balances both error types
0.50
Sensitivity
85.0%
TP / (TP + FN)
Specificity
81.5%
TN / (TN + FP)
Precision (PPV)
82.1%
TP / (TP + FP)
Accuracy
83.3%
(TP + TN) / Total
F1-Score
83.5%
2·P·R / (P + R)
Bal. Accuracy
83.3%
(Sens + Spec) / 2

Dataset Scenario

50/50
99.5/0.5
95/5
Predicted Positive Predicted Negative
True
Positive
TP
170
FN
30
True
Negative
FP
37
TN
163