Equiping Retrieval-Augmented Large Language Models with Document Structure Awareness

ACL ARR 2025 May Submission8033 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While large language models (LLMs) demonstrate impressive capabilities, their reliance on parametric knowledge often leads to factual inaccuracies. Retrieval-Augmented Generation (RAG) mitigates this by leveraging external documents, yet existing approaches treat retrieved passages as isolated chunks, ignoring valuable document structure that could enhance knowledge acquisition and utilization. Motivated by this gap, we propose \textit{\textbf{R}etrieve-\textbf{D}ocument\textbf{R}oute-\textbf{R}ead} (\textbf{RDR\textsuperscript{2}}), a novel framework that explicitly incorporates document structure throughout the RAG process. RDR\textsuperscript{2} employs an LLM-based router to dynamically navigate document structure trees, jointly evaluating content relevance and hierarchical relationships to assemble optimal evidence. Our key innovation lies in formulating document routing as a trainable task, with automatic behavior curation and structure-aware passage selection inspired by human reading strategies. Through comprehensive evaluation on three challenging datasets, RDR\textsuperscript{2} achieves state-of-the-art performance, demonstrating that explicit structural awareness significantly enhances RAG systems' ability to acquire and utilize knowledge, particularly in complex scenarios requiring multi-document synthesis.
Paper Type: Long
Research Area: Generation
Research Area Keywords: retrieval-augmented generation; open-domain QA; text-to-text generation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 8033
Loading