RegionDoc-R1: Reinforcing Semantic Layout-Aware Learning for Document Understanding

20 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcing Learning, Document Understanding
Abstract: Recently, eliciting reasoning abilities in Multimodal Large Language Models (MLLMs) through rule-based Reinforcement Learning (RL) has proven promising. In this work, we introduce RegionDoc-R1, a novel framework for document understanding that enhances MLLMs’ reasoning with step-wise feedback. Directly applying RL training with the GRPO algorithm to document reasoning presents two primary challenges: (i) a lack of layout modeling for document understanding, and (ii) the scarcity of high-quality document-reasoning data. To address these issues, we first propose the Region-Aware Group Relative Policy Optimization (RA-GRPO), which encourages models to utilize region-level spatial information in documents for reasoning. Instead of previous OCR-based text positions, we incorporate high-quality semantic reasoning layout in documents, linking visual regions directly to question-answer semantics. Correspondingly, we construct a hybrid training corpus, named SR-Doc, containing Semantic Reasoning (SR) examples enriched with cross-page and region-level reasoning layout annotations. Meanwhile, we also introduce an Adaptive Chain-of-Thought (Ada-CoT) strategy, which dynamically adjusts the reasoning process according to different tasks, enabling more efficient and flexible step-wise document understanding. Experiments on several document reasoning benchmarks demonstrate that RegionDoc-R1 achieves state-of-the-art performance across tasks such as form understanding, table-based QA, and layout-sensitive information extraction.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 23365
Loading