Improving Genomic Models via Task-Specific Self-Pretraining

Sohan Mupparapu; Parameswari Krishnamurthy; Ratish Puduppully

Improving Genomic Models via Task-Specific Self-Pretraining

Sohan Mupparapu, Parameswari Krishnamurthy, Ratish Puduppully

Published: 11 Jun 2025, Last Modified: 18 Jul 2025GenBio 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: self-supervised learning, DNA language models, low-resource learning

TL;DR: We show that self-pretraining on task-specific genomic data improves downstream performance over strong supervised baselines.

Abstract: Pretraining DNA language models (DNALMs) on the full human genome is resource-intensive, yet often considered necessary for strong downstream performance. Inspired by recent findings in NLP and long-context modeling, we explore an alternative: self-pretraining on task-specific, unlabeled data. Using the BEND benchmark, we show that DNALMs trained with self-pretraining match or exceed the performance of models trained from scratch under identical compute. While genome-scale pretraining may still offer higher absolute performance, task-specific self-pretraining provides a practical and compute-efficient strategy for building stronger supervised baselines. We will release code, pretrained model and finetuned models to support reproducibility.

Submission Number: 63

Loading