Keywords: DNA, Genomics, Health, Representation Learning
TL;DR: a genomic sequence model that unifies long-context reasoning and sequence interpretation with state-of-the-art performance.
Abstract: The interpretation of genomic sequences is crucial for understanding biological processes. To handle the growing volume of DNA sequence data, Genomic Foundation Models (GFMs) have been developed by adapting architectures and training paradigms from Large Language Models (LLMs). Despite their remarkable performance in DNA sequence classification tasks, there remains a lack of systematic understanding regarding the training and task-adaptation processes of GFMs. Moreover, existing GFMs cannot achieve state-of-the-art performance on both short and long-context tasks and lacks multimodal abilities. By revisiting pre-training architectures and post-training techniques, we propose **Omni-DNA**, a family of models spanning 20M to 1.1B parameters that supports sequence understanding, long-context genomic reasoning, and natural-language annotation. **Omni-DNA** establishes new state-of-the-art results on 18 of 26 evaluations drawn from Nucleotide Transformer and Genomic Benchmarks. When jointly fine-tuning on biologically related tasks, **Omni-DNA** consistently outperform existing models and demonstrate multi-tasking abilities. To enable processing of arbitrary sequence lengths, we introduce **SEQPACK**—an adaptive compression operator that packs historical tokens into a learned synopsis using a position-aware learnable sampling mechanism, enabling transformer-based models to process ultra-long sequences with minimal memory and computational requirements. Our approach demonstrates superior performance on enhancer-target interaction tasks, capturing distant regulatory interactions at the 450kbp range more effectively than existing models. Finally, we present a new dataset termed **seq2func**, enabling Omni-DNA to generate accurate and functionally meaningful interpretations of DNA sequences, unlocking new possibilities for genomic analysis and discovery.
Supplementary Material: zip
Primary Area: Machine learning for sciences (e.g. climate, health, life sciences, physics, social sciences)
Submission Number: 16922
Loading