Keywords: DNA language models, long sequence processing, recurrent models, computational genomics
TL;DR: Integrating the Recurrent Memory Transformer with existing GENA-LM DNA language models enhances performance in processing long DNA sequences, offering promising prospects for advancing computational genomics.
Abstract: Utilizing DNA language models based on the transformer architecture represents a significant advancement in the field of computational genomics. However, these models face a critical challenge due to their inherent limitations in handling input lengths comparable to those of individual vertebrate genes (ranging from $10^4$ to $10^5$ nucleotides) and complete genomes (typically around $10^9$ nucleotides). Currently, the architecture with the longest sequence input among publicly available transformer-based DNA language models, GENA-LM, is constrained to a maximum input length of merely $3\cdot10^4$ nucleotides. In this study, we investigate the efficacy of the Recurrent Memory Transformer (RMT) in enhancing GENA-LM for multiple genomic analysis tasks that require processing long DNA sequence inputs. Our results demonstrate that augmenting GENA-LMs with RMT leads to a substantial enhancement in performance, particularly in tasks such as species classification and prediction of epigenetic features. This underscores the significance of the recurrent memory approach in advancing the field of computational genomics and its potential for addressing critical challenges associated with processing long sequence inputs.
Submission Number: 59
Loading