A Topologically Guided Machine Learning Framework for Enhanced Fine-Mapping in Whole-Genome Bacterial Studies
This paper proposes a feature selection framework for machine learning–based bacterial genome-wide association studies aimed at uncovering resistance-causing traits. Using a well-characterized Staphylococcus aureus pangenome as a ground truth for causal‐variant labels, we demonstrate improved control for population structure and enhanced interpretability through the explicit incorporation of genomic context derived from graph-structured data, based on the compacted de Bruijn graph for an assembled pangenome. Our framework successfully uncovers resistance-causing traits for 9 of 14 antibiotics using a significantly reduced feature set, while preserving genomic marker identifiability via unique mappings between the encoded feature space and sequential representations that tag specific genomic loci.