RASRAG: A DOMAIN-SPECIFIC RAG FRAMEWORK AND BENCHMARK FOR ROBOTIC-ASSISTED SURGERY

RASRAG: A DOMAIN-SPECIFIC RAG FRAMEWORK AND BENCHMARK FOR ROBOTIC-ASSISTED SURGERY

ICLR 2026 Conference Submission21253 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Retrieval Augmented Generation, Robotic Assisted Surgery, Benchmark

TL;DR: RASRAG, an agentic Tree-RAG built from a hierarchical robotic-assisted surgery textbook and using RankLLaMA to jointly perform exploration and reranking, matches or outperforms state-of-the-art LLM/RAG baselines on precision and relevance

Abstract: Robot-assisted surgery (RAS) has significantly improved patient outcomes by reducing blood loss, shortening hospital stays, and accelerating recovery. Despite these benefits, the widespread adoption of RAS has been slowed by a shortage of trained robotic surgeons and limited access to robotic systems. One of the major limitations is access to academic materials and expertise in this domain, which are mostly limited to private company programs or a few textbooks. In this regard, foundation models and large language models (LLMs) have been shown to excel in both information retrieval and knowledge synthesis. However, none have been specifically adapted to the complexities of the RAS domain. To address this gap, we introduce RASRAG, a RankLLaMA-based Tree Retrieval-Augmented Generation framework that leverages a hierarchical structure derived from the source textbook. Our contributions are: (1) a novel tree-based RAG architecture in which RankLLaMA jointly performs agentic exploration and reranking along the hierarchy (“forest of knowledge”), yielding more relevant retrieval than embedding only baselines, fine-tuned models, and alternative RAG methods; (2) a publicly available, first-of-its-kind question–answer benchmark curated by seven surgeons and two physicians, reflecting real-world RAS clinical inquiries; and (3) clinically grounded evaluation protocol, including blind grading of both model and human answers by surgeons and RAG-specific measures of retrieval and answer quality. RASRAG with significantly smaller models matches or outperforms state-of-the-art LLMs, fine-tuned LLMs, and existing RAG architectures in terms of precision and relevance for domain-specific tasks.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 21253

Loading