(Be Cautious!) Bio-Foundation Models Are Not Yet Robust to Biologically Plausible Perturbations and ML Transformations
Abstract: Though biological foundation models (Bio-FMs) have delivered strong performance across biomedical tasks, their robustness to small-but-real perturbations is underexplored. In this work, we ask: Are Bio-FMs robust for real-world use? What perturbations compromise their reliability? Our pilot study suggests that due to subtle biological data curation issues and common machine-learning (ML) processing choices, Bio-FMs suffer from two complementary perturbation sources: biologically plausible perturbations (capturing experimental corruptions and curation artifacts) and ML-induced transformations (capturing preprocessing, data augmentation, and embedding choices). Guided by this taxonomy, we design perturbation suites that mimic corruptions frequently encountered in biological experiments, and we systematically probe how transformations in the ML pipeline reshape model behavior. By conducting 2,128 experiments over 11 state-of-the-art Bio-FMs on 7 bio-tasks, we show that most Bio-FMs are vulnerable to both biological perturbations and ML transformations, revealing underappreciated robustness gaps that can directly translate into deployment risk. Interestingly, we find that subtle biological perturbations, which are often imperceptible to current measurement tools, can induce severe discrepancies in Bio-FM outputs and lead to critical failures, yet cryo-EM models (e.g., CryoDRGN) exhibit a surprising level of robustness even under worst-case perturbations. Our study for the first time surfaces critical failure modes and provides a principled perspective for evaluating the robustness of Bio-FMs.
Lay Summary: Bio-foundation models are increasingly used in drug discovery and biomedical research, but real biological data often contain small errors or noise. This paper tests whether 11 state-of-the-art bio-foundation models remain reliable under realistic biological perturbations and common ML processing changes. We find that many models suffer large performance drops from seemingly minor changes, showing that clean benchmark accuracy is not enough. Our results highlight the need to evaluate and train bio-foundation models for robustness before using them in real-world biomedical pipelines.
Primary Area: Applications
Keywords: bio-foundation models, trustworthy foundation models, robustness
Originally Submitted PDF: pdf
Submission Number: 3677
Loading