Character-level Tokenizations as Powerful Inductive Biases for RNA Foundational Models

Published: 06 Mar 2025, Last Modified: 18 Apr 2025ICLR 2025 Workshop LMRLEveryoneRevisionsBibTeXCC BY 4.0
Track: Full Paper Track
Keywords: RNA Language Model, character-level tokenization, RNA Foundation Model
TL;DR: We train a character-level learnable tokenizer along a BERT architecture that learns efficiently and competitively to model RNA sequences.
Abstract: RNA plays a critical role in cellular functions and is increasingly targeted for therapeutics, yet its structural complexity poses challenges for computational modeling. While foundational models have transformed protein representation learning, achieving similar success for RNA remains elusive. We introduce ChaRNABERT, a suite of sample- and parameter-efficient RNA foundational models that leverage a learnable tokenization process to achieve superior performance across established benchmarks. We further validate its capabilities on downstream tasks, including RNA-protein and aptamer-protein interaction prediction. The ChaRNABERT-8M model, along with inference code, will be publicly available for academic research, with additional models provided upon request.
Attendance: Adrián Morales-Pastor
Submission Number: 62
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview