Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning

ACL ARR 2026 January Submission976 Authors

26 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Document Understanding, Multimodal Reasoning, Multimodal Large Language Model
Abstract: Document understanding aims to perform question answering and information extraction over document images, where the visual content is highly information-dense and most task queries depend on only a few relevant regions. However, existing methods either adopt a one-pass strategy that implicitly assumes all regions are equally important, or focus excessively on small regions at the cost of losing critical layout information, leading to unfaithful responses. Following the human reading pattern, we introduce Doc-CoB (Chain-of-Box), a simple-yet-effective mechanism that integrates coarse-to-fine visual reasoning into MLLM without modifying its architecture. Our method allows the model to autonomously select the set of layouts most relevant to the query, and then focus on them for further understanding. To support this paradigm, we design two enabling tasks that improve box identification and box–query reasoning, facilitating layout-aware document understanding. We also design an automatic pipeline, integrating a commercial MLLM with a layout analyzer, to generate 249k training samples with intermediate visual reasoning supervision. Extensive experiments on seven benchmarks with four popular models show that Doc-CoB significantly improves performance, demonstrating its effectiveness and wide applicability. All code, data, and models will be released.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Document Understanding, Multimodal Reasoning, Multimodal Large Language Model
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 976
Loading