BasePrompt: Self-Prompting Genome Language Models for RNA Fitness Prediction

Jin Gao; Zirui Zeng; Zheling Tan; Junhao Shi; Dequan Wang

BasePrompt: Self-Prompting Genome Language Models for RNA Fitness Prediction

Jin Gao, Zirui Zeng, Zheling Tan, Junhao Shi, Dequan Wang

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: RNA fitness prediction, Prompting, Genome Language Model, DNA, Inference-time, Test-time

TL;DR: BasePrompt enables Genome Language Models (GLMs) to self-prompt for RNA fitness prediction, improving performance without labeled data by leveraging reverse-complement symmetry, achieving significant gains over existing models.

Abstract: Genome Language Models (GLMs) pre-trained on trillions of nucleotides already exhibit strong zero-shot RNA fitness predictors, yet they cannot be steered toward a specific assay the way a language model is steered by a prompt. We close this gap by letting GLMs prompt themselves. Our method, BasePrompt, asks GLMs to propose short nucleic-acid prefixes and postfixes that maximally activate the fitness signal for a given sequence. To overcome the causal, forward-only nature of most GLMs, we exploit reverse-complement symmetry and generate upstream as well as downstream prompts without ever updating weights or using labeled variants. For zero-shot RNA fitness prediction on RNAGym, BasePrompt achieves a 6.0\% relative improvement over the SOTA Evo2 7B model and 6.6\%–16.4\% over other GLMs, as measured by Spearman correlation. Auxiliary DNA tasks show the same prompting method compresses native-context information into shorter, model-aligned tokens, boosting pathogenicity classification and next-k-base prediction.

Supplementary Material: zip

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 6137

Loading