MSA-LM: Integrating DNA-level Inductive Biases into DNA Language Models

Vishrut Thoutam

MSA-LM: Integrating DNA-level Inductive Biases into DNA Language Models

Vishrut Thoutam

Published: 12 Oct 2024, Last Modified: 15 Dec 2024AIM-FM Workshop @ NeurIPS'24 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Genomics, DNA Modeling, Mutation Prediction, Multiple Sequence Alignments, Transformers, Subquadratic Models, Mamba

TL;DR: Our model integrates a bidirectional scan of a main DNA sequence with efficient MSA augmentation, achieving SOTA results in key Genomic Benchmarks and variant effect prediction tasks while being more computationally efficient than previous methods.

Abstract: Recent advances in DNA language modeling have been limited by computational constraints and the ability to capture long-range dependencies within genomic data effectively. While effective, traditional transformer-based models suffer from quadratic complexity and limited context windows, making them unsuitable for large-scale DNA modeling. In contrast, subquadratic models, while efficient, often lack bidirectionality and struggle with training scalability. We introduce MSA-LM, an inductive-bias-aware subquadratic DNA Multiple Sequence Alignment (MSA) model that addresses these limitations. MSA-LM integrates a bidirectional Mamba model for sequence mixing, providing transformer-like expressibility without the associated quadratic complexity. By utilizing a sparse attention mechanism, MSA-LM selectively processes the main DNA sequence while incorporating evolutionary information from MSA data, significantly reducing computational overhead. Our results demonstrate that MSA-LM achieves state-of-the-art performance on long-context variant effect prediction tasks and Genomic Benchmarks, particularly excelling in regulatory sequence analysis. The proposed model not only surpasses existing transformer-based and subquadratic approaches in efficiency but also maintains high accuracy across diverse genomic tasks, marking a significant improvement in DNA language modeling capabilities.

Submission Number: 18

Loading