Improving RNA Secondary Structure Prediction Through Expanded Training Data

Conner J. Langeberg; Taehan Kim; Roma Nagle; Charlotte Meredith; Dimple Amitha Garuadapuri; Jennifer Doudna; Jamie H. D. Cate

Improving RNA Secondary Structure Prediction Through Expanded Training Data

Conner J. Langeberg, Taehan Kim, Roma Nagle, Charlotte Meredith, Dimple Amitha Garuadapuri, Jennifer Doudna, Jamie H. D. Cate

Published: 24 Sept 2025, Last Modified: 26 Dec 2025NeurIPS2025-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0

Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.

Track: Track 1: Original Research/Position/Education/Attention Track

Keywords: RNA secondary structure, RNA database, secondary structure prediction, machine learning

Abstract: In recent years, deep learning has revolutionized protein structure prediction, achieving remarkable speed and accuracy. RNA structure prediction, however, has lagged behind. Although several methods have shown some success in predicting RNA secondary and tertiary structures, none have reached the accuracy observed with contemporary protein models. The lack of success of these RNA structure prediction models has been proposed to be due to limited high-quality structural information that can be used as training data. To probe this proposed limitation, we developed a large and diverse dataset comprising paired RNA sequences and their corresponding secondary structures. We assess the utility of this enhanced dataset by retraining on a deep learning model, SincFold. We find that SincFold exhibited improved generalization to some previously unseen RNA families, enhancing its capability to predict accurate de novo RNA secondary structures. The RNASSTR dataset provides a substantial advance for RNA structure modeling, laying a strong foundation for the development of future RNA secondary structure prediction algorithms.

Submission Number: 226

Loading