DNA Language Models for RNA Analyses

Shiyi Du; Litian Liang; Jiayi Li; Carl Kingsford

DNA Language Models for RNA Analyses

Shiyi Du, Litian Liang, Jiayi Li, Carl Kingsford

13 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Genomic Language Models, RNA Sequence Analysis, Parameter-Efficient Fine-Tuning, Mixture of Experts, Computational Efficiency

TL;DR: We propose CodonMoE, a versatile module that adapts DNA language models of diverse architectures for RNA analyses, reducing the burden of maintaining separate DNA and RNA models while significantly improving DNA model performance.

Abstract: Genomic Language Models (gLMs), encompassing DNA models, RNA models, and multimodal models, are becoming widely used for the analysis of biological sequences. Typically, models trained on RNA are used for RNA-related tasks, and models trained on DNA sequences are used for DNA tasks. However, this requires the development and maintenance of several classes of models to match the modality of the sequence. These models take significant resources and data to create, and maintaining separate models for DNA and RNA tasks is a computational burden. To reduce this burden, we introduce novel Adaptive Mixture of Codon Reformative Experts (CodonMoE) that can be incorporated into DNA gLMs in order to adapt them for mRNA-based predictive tasks. We show that, by using this plug-and-play operator, DNA-based gLMs can achieve performance similar to that of RNA-trained models on mRNA tasks. We further show that recent, efficient sub-quadratic DNA-based state space model (SSM) architectures can be used with the CodonMoE to achieve parameter- and computationally-efficient predictions for mRNA tasks. Specifically, experimental results demonstrate that CodonMoE improves diverse DNA-based backbones by a big margin, with some models achieving comparable or superior performance to current state-of-the-art RNA-specific models across several downstream tasks, while reducing both time complexity and model parameters. Our results provide a path for focusing development efforts of gLMs on DNA models, which can then be adapted to mRNA tasks. Because DNA data is more prevalent than assembled mRNA data, and modeling efforts can focus on a single class of model, this is likely to foster improved DNA models for mRNA tasks at lower computational cost, and is a significant step towards unifying genomic language modeling.

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 127

Loading