TroubleRAG: Evaluating Retrieval Pipelines for Real-World Chemistry Troubleshooting

Published: 24 Sept 2025, Last Modified: 15 Oct 2025NeurIPS2025-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: Retrieval-Augmented Generation (RAG), Chemistry, Benchmarking
Abstract: Troubleshooting complex laboratory instruments, such as in chromatography and mass spectrometry, presents a significant information retrieval challenge due to highly specific, technical documentation. Existing chemistry RAG benchmarks primarily target short, general-purpose Q/A tasks, whereas chemists need tools that can address open-ended, laboratory troubleshooting questions in the context of scientific research. To bridge this gap, we introduce TroubleRAG, a comprehensive benchmark for evaluating Retrieval-Augmented Generation (RAG) pipelines in this domain. We first constructed a novel dataset of 113 high-quality troubleshooting scenarios, curated from synthetic data using LLM-based scoring and expert chemists validation. Using TroubleRAG, we conduct an empirical analysis of key retrieval design choices, including sparse, dense, and hybrid fusion; HyDE query expansion; and advanced chunking strategies. Our key finding is that widely recommended “best-practice” RAG configurations do not transfer: they underperform on specialized troubleshooting tasks. Guided by empirical analysis, we introduce a domain-tailored retrieval recipe that yields significant improvements, boosting Recall@5 by 8% and nDCG@5 by 8%. We also outline two extensions: (i) multimodal retrieval over tables and figures that routinely appear in instrument manuals, and (ii) multi-turn, interactive systems that request clarifying details to better reflect human-in-the-loop workflows. TroubleRAG is designed to advance robust, domain-aware RAG methodologies for practical laboratory support.
Submission Number: 397
Loading