BacBench: Evaluating Genomic Language Models for Bacteria

ICLR 2026 Conference Submission20723 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: bacteria, computational biology, benchmark, datasets, genomics, dna language model, protein language model, genomic language model, cell
TL;DR: The first comprehensive set of datasets and benchmarks across tasks and species for evaluating genomic language models for bacteria.
Abstract: Bacteria underpin key processes in health, ecology, and biotechnology, yet machine learning in bacterial genomics lacks systematic, large-scale evaluation resources. Current resources are typically limited to single-species datasets, where the small number of available genomes leaves species-specific models underpowered, underscoring the need for approaches that can generalize across the bacterial tree of life. To address this gap, we present BacBench, the first comprehensive benchmark for bacterial genomics. BacBench consists of 11 datasets across 6 tasks, including a newly generated dataset for operon identification derived from long-read RNA sequencing. BacBench covers gene-, system-, and genome-scale prediction tasks, spanning $67$k genomes, $17.6$k species and $255$M proteins. We analyze the performance of state-of-the-art DNA LMs, protein LMs and bacterial LMs and find that while each approach excels at different scales—the existing models fail to accurately predict the bacterial phenotype at a whole-genome level, hampering the translation to high-impact applications such as antibiotic-resistance and bioproduction. Therefore, highlighting the need to develop methods that reason over the context of the entire genomes, exploiting genomic synteny and transfer across species. We outline the key requirements for such models and release a standardized library for preprocessing, embedding, and evaluation, fostering the development of methods that accurately represent bacterial genomes, and enabling reproducible comparison of diverse approaches under a unified framework. By providing the first comprehensive benchmark dedicated to bacterial genomics, BacBench lays the ground-work for developing machine learning models that truly exploit shared evolutionary patterns and generalize across the bacterial tree of life.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 20723
Loading