ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents

ACL ARR 2026 January Submission2606 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Document Understanding, Visual Language Model, Reinforcement Learning
Abstract: While Vision–language models (VLMs) interpret text-rich images effectively, they struggle with reasoning across long, multi-page documents. We present $\textbf{A}$ctive $\textbf{L}$ong-$\textbf{D}$ocum$\textbf{E}$nt $\textbf{N}$avigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents rather than passive readers. ALDEN features a novel $\texttt{fetch}$ action that allows direct page indexing, complementing the classic $\texttt{search}$ action and better exploiting document structure. To ensure training efficiency and stability, we introduce a rule-based cross-level reward for dense supervision and a visual-semantic anchoring mechanism utilizing dual-path KL-divergence constraints. We train ALDEN on a curated corpus built from open-source datasets where trivial samples are filtered, and queries are rewritten to incentivize multi-turn navigation and fetch usage. Empirically, ALDEN achieves state-of-the-art results on five long-document benchmarks, offering a more accurate and efficient path for long-document understanding.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: NLP Applications; Question Answering; Information Retrieval and Text Mining
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 2606
Loading