Machine learning models for hERG cardiotoxicity are benchmarked by aggregate metrics on a single data split. When a model reports p(toxic) = 0.72, practitioners assume that number is meaningful. But is it?
We evaluated two architectures — a graph neural network (D-MPNN) and a fingerprint-based classifier (XGBoost) — across two independent datasets, three splitting strategies, and five random seeds. For each test molecule, we measured its structural similarity to the nearest training compound and asked: does prediction quality degrade as chemistry becomes more novel?
The degradation is not specific to one model or one dataset. It appears in every well-powered experimental condition we tested — a universal property of the prediction task, not an artifact of any single experimental choice.
| Similarity to training set | AUROC ↑ | Brier score ↓ | n |
|---|---|---|---|
| > 0.7 (most familiar) | 0.857 | 0.136 | 146 |
| 0.5 – 0.7 | 0.795 | 0.168 | 222 |
| 0.3 – 0.5 | 0.697 | 0.227 | 344 |
| < 0.3 (most novel) | 0.692 | 0.275 | 217 |
The full study includes cross-model comparisons showing that the two architectures fail in complementary ways along this gradient, probability-collapse diagnostics under extreme distribution shift, and a downstream analysis of counterfactual lead optimisation under calibration-aware constraints.
Manuscript in preparation. For inquiries about the study, data, or code: h.sreedharan_sreedeth@unswalumni.com