Recurrent memory augmentation of GENA-LM improves performance on long DNA sequence tasks

Yuri Kuratov; Aleksei Shmelev; Veniamin Fishman; Olga Kardymon; Mikhail Burtsev

Recurrent memory augmentation of GENA-LM improves performance on long DNA sequence tasks

Yuri Kuratov, Aleksei Shmelev, Veniamin Fishman, Olga Kardymon, Mikhail Burtsev

Published: 04 Mar 2024, Last Modified: 27 Apr 2024MLGenX 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: DNA language models, long sequence processing, recurrent models, computational genomics

TL;DR: Integrating the Recurrent Memory Transformer with existing GENA-LM DNA language models enhances performance in processing long DNA sequences, offering promising prospects for advancing computational genomics.

Abstract: Utilizing DNA language models based on the transformer architecture represents a significant advancement in the field of computational genomics. However, these models face a critical challenge due to their inherent limitations in handling input lengths comparable to those of individual vertebrate genes (ranging from $10^4$ to $10^5$ nucleotides) and complete genomes (typically around $10^9$ nucleotides). Currently, the architecture with the longest sequence input among publicly available transformer-based DNA language models, GENA-LM, is constrained to a maximum input length of merely $3\cdot10^4$ nucleotides. In this study, we investigate the efficacy of the Recurrent Memory Transformer (RMT) in enhancing GENA-LM for multiple genomic analysis tasks that require processing long DNA sequence inputs. Our results demonstrate that augmenting GENA-LMs with RMT leads to a substantial enhancement in performance, particularly in tasks such as species classification and prediction of epigenetic features. This underscores the significance of the recurrent memory approach in advancing the field of computational genomics and its potential for addressing critical challenges associated with processing long sequence inputs.

Submission Number: 59

Loading