Enhancing Splice Site Detection Through Deep Learning and Retrieval-Augmented Generation

Enhancing Splice Site Detection Through Deep Learning and Retrieval-Augmented Generation

NeurIPS 2025 Workshop FM4LS Submission73 Authors

06 Sept 2025 (modified: 18 Nov 2025)Submitted to NeurIPS 2025 2nd Workshop FM4LSEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RNA splicing, splice-site prediction, genomics, interpretable deep learning, retrieval-augmented generation (RAG), large language models (LLM), convolutional neural networks (CNN), NdLinear (N-Dimensional Linear layers), k-mer features, GTEx, Ensembl, grounding evaluation, biomedical NLP, variant analysis, clinical genomics

TL;DR: Hybrid CNN+NdLinear model predicts donor/acceptor splice sites and pairs predictions with RAG (GTEx/Ensembl) and an LLM to produce grounded, traceable explanations. Delivers competitive accuracy with plausible motifs and linked evidence.

Abstract: In functional genomics there is growing need for predictive models that are not only accurate but also interpretable —especially for tasks like splice site classification, where tissue-specific expression, motif patterns, and regulatory context all influence biological function. We propose a modular architecture that combines deep neural networks—including N-Dimensional Linear layers—for splice site prediction with retrieval-augmented generation (RAG) to surface tissue- and gene-level biological context, followed by explanation generation using a large language model. Our method unifies sequence-based modeling and biological retrieval into a coherent pipeline that predicts splice site labels and generates human-readable explanations grounded in gene function, tissue expression, and regulatory context. While prior models focus on predictive performance, our work uniquely combines biological retrieval and language-based reasoning to address the critical gap of interpretability in splicing analysis. By making splice site predictions interpretable, our system enables downstream applications in variant analysis, transcriptomics, and clini- cal genomics. It bridges machine learning and NLP with biological challenges, advancing interpretable AI for biomedical discovery.

Submission Number: 73

Loading