Keywords: Multimodal Large Language model, Visual Question Answering, Robustness
TL;DR: A novel feature-level restoration method for MLLM in low-quality document image scenarios, with a large-scale VQA dataset for restoration training.
Abstract: Document images are primary carriers of knowledge and information, yet their effective understanding is often hindered by degradations such as noise, blur, and low resolution. In this paper, we address the challenge of robust document understanding under such low-quality conditions by proposing the DocRobust-Module (DRM)—an efficient feature restoration module that, when integrated with a multimodal large language model, enables the recovery of lost visual and semantic information with minimal parameter modifications. Our method is supported by a novel two-stage training strategy that incrementally guides the model to restore critical information from both visual and semantic perspectives. To support the fine-tuning of MLLMs with DRM, we construct DocRobust-VQA, a large-scale visual question answering dataset containing extensive low-quality document images along with high-quality counterparts and QA annotations. With over 189K clear-blurry images pairs annotated by 417K QA pairs, DocRobust-VQA provides sufficient finetuning data for enhancing the robustness of MLLMs under real-world degradations. Extensive experiments demonstrate that our method consistently improves performance on low-quality document images, offering new insights and a scalable solution for robust document understanding.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17969
Loading