GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

Siqi Li; Yufan Shen; Xiangnan Chen; Jiayi Chen; Hengwei Ju; Haodong Duan; Song Mao; Hongbin Zhou; Bo Zhang; Bin Fu; Pinlong Cai; Licheng Wen; Botian Shi; Xinyu Cai; Yong Liu; Yu Qiao

GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

Siqi Li, Yufan Shen, Xiangnan Chen, Jiayi Chen, Hengwei Ju, Haodong Duan, Song Mao, Hongbin Zhou, Bo Zhang, Bin Fu, Pinlong Cai, Licheng Wen, Botian Shi, Xinyu Cai, Yong Liu, Yu Qiao

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Document Understanding, Benchmark, MLLM

TL;DR: We introduce GDI-Bench, a document-domain benchmark with a difficulty grading system that decouples visual and reasoning complexity for systematic model evaluation and optimization.

Abstract: The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic improvements. To bridge this gap, we introduce a General Document Intelligence Benchmark (GDI-Bench), featuring 2.3k images across 9 key scenarios and 19 document-specific tasks. By decoupling visual complexity and reasoning complexity, the GDI-Bench structures graded tasks that allow performance assessment by difficulty, aiding in model weakness identification and optimization guidance. We evaluate various open-source and closed-source models on GDI-Bench, conducting decoupled analyses in the visual and reasoning domains, revealing their strengths and weaknesses. To address the diverse tasks and domains in the GDI-Bench, we propose a GDI-Model that mitigates catastrophic forgetting during the supervised fine-tuning (SFT) process through an intelligence-preserving training strategy, thereby reinforcing the inherent weaknesses of the base model. Our model achieves state-of-the-art performance on previous benchmarks and the GDI-Bench. Both our benchmark and models are or will be open-sourced on \url{https://huggingface.co/GDIBench}.

Primary Area: datasets and benchmarks

Submission Number: 15500

Loading