Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: DNA, Genome Language Model, Inference-time, AI for Life Science
TL;DR: We present BasePrompt, an inference-time context expansion strategy that leverages the symmetry of DNA’s double strand, facilitating both generative and discriminative tasks.
Abstract: Genome Language Models (GLMs) pre-trained on trillions of nucleotides already exhibit strong zero-shot RNA fitness predictors, yet they cannot be steered toward a specific assay the way a language model is steered by a prompt.
We close this gap by letting GLMs prompt themselves.
Our method, BasePrompt, asks GLMs to propose short nucleic-acid prefixes and postfixes that maximally activate the fitness signal for a given sequence.
To overcome the causal, forward-only nature of most GLMs, we exploit reverse-complement symmetry and generate upstream as well as downstream prompts without ever updating weights or using labeled variants.
For zero-shot RNA fitness prediction on RNAGym, BasePrompt achieves a 6.0\% relative improvement over the SOTA Evo2 7B model and 6.6\%–16.4\% over other GLMs, as measured by Spearman correlation.
Auxiliary DNA tasks show the same prompting method compresses native-context information into shorter, model-aligned tokens, boosting pathogenicity classification and next-k-base prediction.
Submission Number: 47
Loading