Keywords: RAG, Law, Hallucination
Abstract: Retrieval-augmented generation (RAG) promises to bridge complex legal statutes and public understanding, yet hallucination remains a critical barrier in real-world use. Because statutes evolve and provisions frequently cross-reference, maintaining *temporal currency* and *citation awareness* is essential, favoring up-to-date sources over static parametric memory.
To study these issues, we focus on the under-examined domain of South Korean fire safety regulation—a complex web of fragmented legislation, dense cross-references, and vague decrees. We introduce **SearchFireSafety**, the first RAG-oriented question-answering (QA) resource for this domain. It includes: (i) 941 real-world, open-ended QA pairs from public inquiries (2023–2025); (ii) a corpus of 4,437 legal documents from 117 statutes with a citation graph; and (iii) synthetic single-hop (Yes/No) and multi-hop (MCQA) benchmarks targeting legal reasoning and uncertainty.
Experiments with five Korean-capable LLMs show that: (1) multilingual dense retrievers excel due to the domain's mix of Korean, English loanwords, and Sino-Korean terms (i.e., Chinese characters); (2) grounding LLMs with **SearchFireSafety** substantially improves factual accuracy; but (3) multi-hop reasoning still fails to resolve conflicting provisions or recognize informational gaps. Additionally, we find that (4) domain adaptation via continued pre-training improves accuracy but significantly degrades uncertainty awareness when evidence in insufficient. Our results affirm that RAG is necessary but not yet sufficient for legal QA, and we offer **SearchFireSafety** as a rigorous testbed to drive progress in Legal AI.
All data resources are available at: https://anonymous.4open.science/r/SearchFireSafety-C2AB/.
Primary Area: datasets and benchmarks
Submission Number: 19077
Loading