Data Augmentation via Genomic Foundation Models for Pseudoknot-Inclusive RNA Secondary Structure Prediction
Keywords: Genomic Foundation Models, Data Augmentation, RNA, Secondary Structure Prediction, Pseudoknot
TL;DR: The paper presents a novel data augmentation technique using masked language modelling and uncertainty quantification with genomic foundation models for improved RNA pseudoknot prediction, achieving state-of-the-art performance.
Abstract: Rapid advancements in genomic foundation models (GFMs) have delivered a series of breakthroughs across a diverse set of tasks for RNA, however RNA Secondary Structure Prediction (SSP) remains a pivotal task in computational biology. Despite achieving breakthroughs in pseudoknot-free SSP, where state-of-the-art models can achieve above 80% macro-F1, performance on the pseudoknot-inclusive problem remains stagnate, with previous methods achieving below 50% macro-F1 on all three of our test-sets. This is due to a variety of challenges: a ginormous search space that limits heuristic performance, the major class imbalance problem that limits the usual classification methods, and the inherent lack of data that limits deep learning methods. Further data acquisition is implausible due to requiring extensive biological resources and being associated with a high cost.
In this work, we propose a novel approach to enhance RNA secondary structure prediction by implementing a novel data augmentation technique, specifically designed for the pseudoknot-inclusive SSP problem. Our method leverages masked language modelling (MLM) with a surrogate model to produce accurate and useful data augmentations, and we further utilise uncertainty quantification strategies to identify areas within the dataset where augmentation is most effective - thereby helping to mitigate the class imbalance problem, and further improving on the generalisability of the models. We further extend three GFMs, and fine-tune them using the augmented datasets to demonstrate the efficacy and high performance of the models.
Notably, the newly extended and augmented models achieve state-of-the-art performance, achieving over 89% F1 on RNAStrAlign, and over 66% F1 on bpRNA test sets respectively. We therefore highlight the effectiveness of data augmentation for genomic data, and release our code and datasets to assist future researchers.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12421
Loading