DNA Language Models for RNA Analyses

ICLR 2025 Conference Submission127 Authors

13 Sept 2024 (modified: 24 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Genomic Language Models, RNA Sequence Analysis, Parameter-Efficient Fine-Tuning, Mixture of Experts, Computational Efficiency
TL;DR: We propose CodonMoE, a versatile module that adapts DNA language models of diverse architectures for RNA analyses, reducing the burden of maintaining separate DNA and RNA models while significantly improving DNA model performance.
Abstract: Genomic Language Models (gLMs), encompassing DNA models, RNA models, and multimodal models, are becoming widely used for the analysis of biological sequences. Typically, models trained on RNA are used for RNA-related tasks, and models trained on DNA sequences are used for DNA tasks. However, this requires the development and maintenance of several classes of models to match the modality of the sequence. These models take significant resources and data to create, and maintaining separate models for DNA and RNA tasks is a computational burden. To reduce this burden, we introduce novel Adaptive Mixture of Codon Reformative Experts (CodonMoE) that can be incorporated into DNA gLMs in order to adapt them for mRNA-based predictive tasks. We show that, by using this plug-and-play operator, DNA-based gLMs can achieve performance similar to that of RNA-trained models on mRNA tasks. We further show that recent, efficient sub-quadratic DNA-based state space model (SSM) architectures can be used with the CodonMoE to achieve parameter- and computationally-efficient predictions for mRNA tasks. Specifically, experimental results demonstrate that CodonMoE improves diverse DNA-based backbones by a big margin, with some models achieving comparable or superior performance to current state-of-the-art RNA-specific models across several downstream tasks, while reducing both time complexity and model parameters. Our results provide a path for focusing development efforts of gLMs on DNA models, which can then be adapted to mRNA tasks. Because DNA data is more prevalent than assembled mRNA data, and modeling efforts can focus on a single class of model, this is likely to foster improved DNA models for mRNA tasks at lower computational cost, and is a significant step towards unifying genomic language modeling.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 127
Loading