Keywords: RNA splicing, splice-site prediction, genomics, interpretable deep learning, retrieval-augmented generation (RAG), large language models (LLM), convolutional neural networks (CNN), NdLinear (N-Dimensional Linear layers), k-mer features, GTEx, Ensembl, grounding evaluation, biomedical NLP, variant analysis, clinical genomics
TL;DR: Hybrid CNN+NdLinear model predicts donor/acceptor splice sites and pairs predictions with RAG (GTEx/Ensembl) and an LLM to produce grounded, traceable explanations. Delivers competitive accuracy with plausible motifs and linked evidence.
Abstract: In functional genomics there is growing need for predictive models that are not only
accurate but also interpretable —especially for tasks like splice site classification,
where tissue-specific expression, motif patterns, and regulatory context all influence
biological function. We propose a modular architecture that combines deep neural
networks—including N-Dimensional Linear layers—for splice site prediction with
retrieval-augmented generation (RAG) to surface tissue- and gene-level biological
context, followed by explanation generation using a large language model. Our
method unifies sequence-based modeling and biological retrieval into a coherent
pipeline that predicts splice site labels and generates human-readable explanations
grounded in gene function, tissue expression, and regulatory context. While prior
models focus on predictive performance, our work uniquely combines biological
retrieval and language-based reasoning to address the critical gap of interpretability
in splicing analysis. By making splice site predictions interpretable, our system
enables downstream applications in variant analysis, transcriptomics, and clini-
cal genomics. It bridges machine learning and NLP with biological challenges,
advancing interpretable AI for biomedical discovery.
Submission Number: 73
Loading