Keywords: DNA Language Model; Reverse-Complement; Foundation Model
TL;DR: Propose a method of finetuning DNA LMs for RC equivariant tasks.
Abstract: A fundamental property of DNA is that the \textbf{reverse complement (RC)} of a sequence often carries identical biological meaning.
However, state-of-the-art DNA language models frequently fail to capture this symmetry, producing inconsistent predictions for a sequence and its RC counterpart, which undermines their reliability.
In this work, we introduce Reverse-Complement Consistency Regularization (RCCR), a simple and model-agnostic fine-tuning objective that directly penalizes the divergence between a model's prediction on a sequence and the aligned prediction on its reverse complement.
We evaluate RCCR across three diverse backbones (Nucleotide Transformer, HyenaDNA, DNABERT-2) on a wide range of genomic tasks, including sequence classification, scalar regression, and profile prediction.
Our experiments show that RCCR substantially improves RC-robustness by dramatically reducing prediction flips and errors, all while maintaining or improving task accuracy compared to baselines like RC data augmentation and test-time averaging.
By integrating a key biological prior directly into the learning process, RCCR produces a single, intrinsically robust, and computationally efficient model fine-tuning recipe for diverse biology tasks.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 4942
Loading