Reverse-Complement Consistency for DNA Language Models

Mingqian Ma

Reverse-Complement Consistency for DNA Language Models

Mingqian Ma

14 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: DNA Language Model; Reverse-Complement; Foundation Model

TL;DR: Propose a method of finetuning DNA LMs for RC equivariant tasks.

Abstract: A fundamental property of DNA is that the \textbf{reverse complement (RC)} of a sequence often carries identical biological meaning. However, state-of-the-art DNA language models frequently fail to capture this symmetry, producing inconsistent predictions for a sequence and its RC counterpart, which undermines their reliability. In this work, we introduce Reverse-Complement Consistency Regularization (RCCR), a simple and model-agnostic fine-tuning objective that directly penalizes the divergence between a model's prediction on a sequence and the aligned prediction on its reverse complement. We evaluate RCCR across three diverse backbones (Nucleotide Transformer, HyenaDNA, DNABERT-2) on a wide range of genomic tasks, including sequence classification, scalar regression, and profile prediction. Our experiments show that RCCR substantially improves RC-robustness by dramatically reducing prediction flips and errors, all while maintaining or improving task accuracy compared to baselines like RC data augmentation and test-time averaging. By integrating a key biological prior directly into the learning process, RCCR produces a single, intrinsically robust, and computationally efficient model fine-tuning recipe for diverse biology tasks.

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Submission Number: 4942

Loading