Structured Attention Matters to Multimodal LLMs in Document Understanding

Chang Liu; Hongkai Chen; Yujun Cai; Hang Wu; Qingwen Ye; Ming-Hsuan Yang; Yiwei Wang

Structured Attention Matters to Multimodal LLMs in Document Understanding

Chang Liu, Hongkai Chen, Yujun Cai, Hang Wu, Qingwen Ye, Ming-Hsuan Yang, Yiwei Wang

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, Attention Analysis, Multi-element Document Understanding

Abstract: Document understanding remains a significant challenge for multimodal large language models (MLLMs). While previous research has primarily focused on locating evidence pages through precise multimodal queries, our work investigates a fundamental yet overlooked aspect: how input format influences document comprehension performance. Through systematic analysis, we discover that plain multi-element text extracted from PDFs often impairs rather than improves MLLMs' performance, a counterintuitive finding that we attribute to attention dispersion and loss of structure. To further substantiate our hypothesis, we propose using the LaTex paradigm as a tool for encoding document elements, maintaining the hierarchical organization and spatial relationships critical for comprehension. Our attention analysis reveals that structured multi-element text induces structured attention patterns in both textual and visual content, directing models to focus on semantically meaningful regions while reducing attention waste. Specifically, we found that the structured text significantly enhance MLLMs' document question-answering performance across diverse document types without requiring architectural modifications or additional training.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 22974

Loading