RNA Generative Modeling With Tree Search

Published: 01 Jan 2024, Last Modified: 30 Dec 2024CIBCB 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Ribonucleic acid (RNA) molecules are key in many biological processes. The ability to generate RNA sequences that fold according to a given target structure while satisfying constraints such as minimum free energy (MFE) and the GC content is important in many applications such as synthetic biology and drug design. In this work, we propose a novel method for designing RNA sequences that satisfy these constraints. Our method uses an RNA-based Language Model (LLM) to generate RNA sequences while guiding the RNA sequence generation process using Monte Carlo Tree Search (MCTS). The MCTS ensures that the RNA-based LLM sequence generation process leads to a valid RNA sequence that folds according to the target structure. Instead of performing random rollout during the simulation phase of the MCTS, we sample the next RNA sequence from the RNA-based LLM. By design, our method can control LLM issues such as hallucinations where the generated sequences are inconsistent with the training data, and generation of invalid sequences as only rollouts that lead to correct RNA design are considered. We show that our method, without user-defined RNA design rules, can generate valid RNA sequences that can fold accordingly while outperforming the existing models on 50% of the test datasets while also recording competitive results on the remaining test datasets.
Loading