BEYOND SEQUENCE-ONLY MODELS: LEVERAGING STRUCTURAL CONSTRAINTS FOR ANTIBIOTIC RESISTANCE PREDICTION IN SPARSE GENOMIC DATASETS

Published: 05 Mar 2025, Last Modified: 21 Apr 2025MLGenX 2025 TinyPapersEveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny paper track (up to 4 pages)
Abstract: To combat the rise of antibiotic-resistant $\textit{Mycobacterium tuberculosis}$, genotype-based diagnosis of resistance is critical, as it could substantially speed time to treatment. However, machine learning efforts at genotype-based resistance prediction are hindered by limited sequence diversity and high redundancy in genomic datasets, complicating model generalization. Here, we introduce a dataset of $\textit{M. tuberculosis}$ sequences for nine key resistance-associated genes and corresponding resistance phenotypes, performing genotype de-duplication to mitigate the effects of data leakage. This study introduces a Fused Ridge approach that moves beyond sequence-only prediction by introducing protein structure constraints. We compare to baseline Ridge regression and zero-shot mutation effect prediction using ESM-2 embeddings. Our results show that Fused Ridge achieves the highest mean AUC (0.766), outperforming Ridge regression (0.755) and ESM-2-based log-likelihood ratio scoring (0.603). It also exhibits improved precision and recall in identifying resistance-conferring variants, particularly for genes such as $\textit{gyrA}$ and $\textit{rpoB}$, likely due to the strong association between the 3D location of mutations and resistance. The fusion penalty enforces smoothness in regression coefficients for spatially adjacent residues, embedding biological knowledge into the predictive framework, and improves generalization in sparse and highly redundant datasets.
Submission Number: 34
Loading