Keywords: large language model, foundation model, RNA language model, Adaptive Tokenization
Abstract: Recent advancements in Transformer-based language models have spurred interest in their use for biological sequence analysis. However, adapting models like BERT is challenging due to sequence length, often requiring truncation for proteomics and genomics tasks. Additionally, advanced tokenization and relative positional encoding techniques for long contexts in NLP are often not directly transferable to DNA/RNA sequences, which require nucleotide or character-level encodings for tasks such as 3D torsion angle prediction, distance map prediction or secondary structure prediction.
To tackle these challenges, we propose an adaptive dual tokenimzation scheme for bioinformatics that utilizes both nucleotide-level (NUC) and efficient BPE tokenizations. Building on the dual tokenization, we introduce BiRNA-BERT, a 117M parameter Transformer encoder pretrained with our proposed tokenization on 28 billion nucleotides across 36 million coding and non-coding RNA sequences.
The learned representation by BiRNA-BERT generalizes across a range of applications.
The BiRNA-BERT model achieves state-of-the-art results in long-sequence downstream tasks, performs comparably well in short-sequence tasks, and matches the performance in nucleotide-level structural prediction tasks, of models six times larger in parameter size, while requiring 27 times less pre-training compute. In addition, our empirical experiments and ablation studies demonstrate that NUC is often preferable over BPE for bioinformatics tasks, given sufficient VRAM availability. We further demonstrate the applicability of the dual-pretraning and adaptive tokenization strategy employing this concept on a DNA language model which provides comparable performance to 66X compute heavy DNA language models.
BiRNA-BERT can dynamically adjust its tokenization strategy based on sequence lengths, utilizing NUC for shorter sequences and switching to BPE for longer ones, thereby offering, for the first time, the capability to efficiently handle arbitrarily long DNA/RNA sequences.
Supplementary Material: pdf
Submission Number: 32
Loading