Track: Special track (up to 8 pages)
Abstract: RNA editing is a critical regulatory process that diversifies the transcriptome by altering nucleotide sequences in messenger RNA molecules. We propose a novel framework for predicting adenosine-to-inosine (A-to-I) RNA editing sites by leveraging a specialized fine-tuned GPT-4o-mini model and a tissue-specific liver dataset. Grounding our approach in the high expression levels of ADAR1 in liver tissue, we avoid confounding factors from other ADAR isoforms and complex multi-tissue data. We categorize editing levels into progressively narrower thresholds (1%, 5%, 10%, and 15%) and introduce continual fine-tuning (CFT) to guide the model step-by-step from low-editing (1%) to high-editing (15%) scenarios. Compared to static fine-tuning (SFT) on a single threshold, our multi-stage method incrementally refines the model's ability to distinguish editing features and demonstrates superior performance over base GPT-3.5/4o-mini models across various configurations. We further show that employing strict, non-overlapping threshold bins facilitates clearer distinctions between edited and non-edited sites, consequently improves performance. In contrast, reducing the distinction between edited and non-edited classes significantly degrades classification accuracy. These findings underscore the importance of biologically appropriate data partitioning and continual, threshold-based fine-tuning in enhancing the predictive power of generative language models for RNA editing. Our study paves the way for future work on building more nuanced models that incorporate tissue-specific constraints, ultimately broadening the applicability of generative AI in post-transcriptional regulation analysis.
The sources of this work are available at our repository: https://zenodo.org/records/14873200.
Submission Number: 46
Loading