hERGBench — Can you trust an AI drug safety prediction?

Your drug safety model is silently wrong on novel chemistry

We show that hERG toxicity predictions degrade along the chemical similarity gradient — and that standard evaluation metrics don't reveal it.

Hari S. Sreedeth · April 2026 · Manuscript in preparation

Machine learning models for hERG cardiotoxicity are benchmarked by aggregate metrics on a single data split. When a model reports p(toxic) = 0.72, practitioners assume that number is meaningful. But is it?

We evaluated two architectures — a graph neural network (D-MPNN) and a fingerprint-based classifier (XGBoost) — across two independent datasets, three splitting strategies, and five random seeds. For each test molecule, we measured its structural similarity to the nearest training compound and asked: does prediction quality degrade as chemistry becomes more novel?

The core finding

Familiar chemistry

Test molecules similar to training set (Tanimoto > 0.7)

Novel chemistry

Test molecules far from training set (Tanimoto < 0.3)

Reliability diagrams for D-MPNN on ChEMBL hERG under scaffold split. Left: for familiar compounds, predicted probabilities closely match observed outcomes (the curve follows the diagonal). Right: for novel compounds, a large calibration gap opens — the model's confidence is no longer grounded in reality. Both panels use the same model, same dataset, same evaluation. Only the structural distance to training data differs.

0.842

AUROC on familiar chemistry
(model ranks compounds correctly)

0.615

AUROC on novel chemistry
(model barely outperforms chance)

The degradation is not specific to one model or one dataset. It appears in every well-powered experimental condition we tested — a universal property of the prediction task, not an artifact of any single experimental choice.

Discrimination degrades with chemical novelty (ChEMBL, D-MPNN)

Similarity to training set	AUROC ↑	Brier score ↓	n
> 0.7 (most familiar)	0.857	0.136	146
0.5 – 0.7	0.795	0.168	222
0.3 – 0.5	0.697	0.227	344
< 0.3 (most novel)	0.692	0.275	217

Cluster split, mean across 5 seeds. AUROC measures ranking quality; Brier score measures probability accuracy (lower = better). Both degrade monotonically as compounds become more structurally novel.

The full study includes cross-model comparisons showing that the two architectures fail in complementary ways along this gradient, probability-collapse diagnostics under extreme distribution shift, and a downstream analysis of counterfactual lead optimisation under calibration-aware constraints.

Manuscript in preparation. For inquiries about the study, data, or code: h.sreedharan_sreedeth@unswalumni.com