Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: DNA, Genome Language Model, Inference-time, AI for Life Science
TL;DR: We present BaseMirror, an inference-time context expansion strategy that leverages the symmetry of DNA’s double strand, facilitating both generative and discriminative tasks.
Abstract: Genome language models (GLMs) have demonstrated exceptional capabilities in DNA sequence generation and understanding, yet their context-dependent performance is limited by the fixed length of input sequences. To address this limitation, we propose BaseMirror, an inference-time strategy that leverages the symmetry of DNA’s double strand to expand the effective context. Our method autoregressively generates tokens along the reverse direction of the reverse-complement strand of a given DNA sequence, then obtains and prepends their complementary bases to the original strand, thereby enriching the model’s effective receptive field. We demonstrate that BaseMirror consistently improves generative and discriminative tasks' performance on the GENERator and Evo2 families. For next‑base prediction, progressively extending the input sequence leads to consistent performance gains across various input lengths, model sizes, and sampling strategies, with accuracy improvements of up to 4.6%, compared to the original non-extended input. For variant effect prediction on BRCA1, BaseMirror enhances the AUROC for zero-shot classification by up to 5.2%. Moreover, we uncover a scaling phenomenon in which performance increases monotonically with the length of the extended context. Our results highlight the effectiveness of BaseMirror as a lightweight, robust, and scalable solution at inference time through API-based GLM generation.
Submission Number: 47
Loading