A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

ACL ARR 2026 January Submission4638 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visually Rich Document, Multimodal Large Language Model, Visual Question Answering, Key Information Extraction

Abstract: Visually Rich Document Understanding (VRDU) has become a pivotal area of research, driven by the need to automatically interpret documents that contain intricate visual, textual, and structural elements. Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant promise in this domain, including both OCR-based and OCR-free approaches for information extraction from document images. This survey reviews recent advances in MLLM-based VRDU, highlighting emerging trends and promising research directions with a focus on two key aspects: (1) techniques for representing and integrating textual, visual, and layout features; (2) training paradigms, including pretraining, instruction tuning, and training strategies. Moreover, we address challenges such as data scarcity, handling multi-page and multilingual documents, and integrating emerging trends such as Retrieval-Augmented Generation and agentic frameworks. Our analysis offers a roadmap for advancing MLLM-based VRDU toward more scalable, reliable, and adaptable systems.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: multimodal applications, document understanding

Contribution Types: Surveys

Languages Studied: English

Submission Number: 4638

Loading