Keywords: RNA Secondary Structure Prediction
Abstract: Accurate prediction of RNA secondary structure is a fundamental yet challenging task in computational biology, crucial for deciphering RNA's functional capabilities.
While recent deep learning methods show promise, they are often limited by their failure to explicitly integrate structural information during pre-training and by the scale of their models and datasets. Here, we introduce Secondary Structure-Aware RNA Language Model(SSR-LM), a 650M-parameter language model pre-trained on 1.1 billion RNA sequences.
A key innovation is our novel Secondary Structure-Aware Span Masking (SSM) pre-training task, which explicitly integrates structural motifs into the model.
To address the lack of a comprehensive benchmark for evaluating model performance on real structures, we construct a new, large-scale PDB-derived RNA secondary structure dataset, three times larger than existing ones.
Comprehensive evaluation demonstrates that SSR-LM achieves state-of-the-art performance, attaining an F1-score of $0.741$ on our new PDB benchmark and $0.635$ on the CASP16 blind test set, showcasing its robust performance.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 2924
Loading