Keywords: Multimodal Large Language Models, Attention Analysis, Multi-element Document Understanding
Abstract: Document understanding remains a significant challenge for multimodal large language models (MLLMs). While previous research has primarily focused on locating evidence pages through precise multimodal queries, our work investigates a fundamental yet overlooked aspect: how input format influences document comprehension performance. Through systematic analysis, we discover that plain multi-element text extracted from PDFs often impairs rather than improves MLLMs' performance, a counterintuitive finding that we attribute to attention dispersion and loss of structure. To further substantiate our hypothesis, we propose using the LaTex paradigm as a tool for encoding document elements, maintaining the hierarchical organization and spatial relationships critical for comprehension. Our attention analysis reveals that structured multi-element text induces structured attention patterns in both textual and visual content, directing models to focus on semantically meaningful regions while reducing attention waste. Specifically, we found that the structured text significantly enhance MLLMs' document question-answering performance across diverse document types without requiring architectural modifications or additional training.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22974
Loading