ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Models, Document Visual Question Answering, Answer Localization, Spatial Grounding
Abstract: Document Visual Question Answering requires models to understand text layouts and ground answers to specific document regions. However, existing systems prioritize textual accuracy while neglecting spatial grounding, limiting interpretability in high-stakes applications. We present \textbf{ARIAL} (Agentic Reasoning for Interpretable Answer Localization), a modular framework using LLMs planning agent to orchestrate specialized components for OCR, retrieval-augmented generation, and spatial grounding. ARIAL decomposes Document VQA into structured tool calls: TrOCR-based text extraction, semantic retrieval over OCR segments, LLMs for answer generation, and precise bounding-box localization. This modular design enables transparent reasoning traces for auditability. We evaluate on four benchmarks (DocVQA, FUNSD, CORD, SROIE) using text similarity (ANLS) and spatial metrics (mAP@IoU). ARIAL achieves new state-of-the-art results, including 88.7 ANLS and 50.1 mAP on DocVQA, surpassing DLaVA by +2.8 ANLS and +3.9 mAP points. ARIAL focuses on spatially-grounded language models, demonstrating how LLMs can be constrained through modular tool orchestration where each answer is locked to specific pixel coordinates and traceable through interpretable reasoning chains. The coding implementation can be found in: https://github.com/ahmad-shirazi/ARIAL
Submission Number: 57
Loading