PANGENSA: A graph-constrained machine learning framework for identifying antibiotic resistance determinants in bacterial pangenomes
Abstract: Antimicrobial resistance is a major global health threat, driving the need for genome-based diagnostics. While bacterial genome-wide association studies aim to identify causal resistance determinants, they are often impaired by the high multicollinearity and underdetermined regimes inherent in pangenomes. Standard machine learning approaches often fail to satisfy the requirement that fine-mapping should yield localised genomic loci rather than correlation-determined groupings.
We introduce PANGENSA, a graph-constrained machine learning framework that uses the topology of compacted de Bruijn graphs as a label-agnostic structural prior. \pgs partitions the pangenome graph into discrete communities and trains independent models on each, ensuring that subproblems are not overparameterized and that results are spatially localised by construction.
We demonstrate that these communities exhibit high internal connectivity and low inter-community leakage. Top-ranking community-level classifiers achieved high AUROC ($\geq 0.86$) across 14 antibiotic phenotypes. Notably, PANGENSA localised known resistance signals for 11 of 14 antibiotics and recovered low-prevalence mechanisms that global baseline methods failed to detect. This demonstrates that encoding genomic locality as a structural prior can effectively amplify under-represented causal signals while controlling for population structure.
Submission Number: 108
Loading