Abstract: Document Visual Question Answering (DVQA) involves responding to queries based on the contents of document images. Existing works are confined to locating information within a single page and lack support for cross-page question-and-answer interactions. Furthermore, the token length limitation on model inputs can lead to the truncation of answer-relevant segments. In this study, we present CREAM, an innovative methodology that focuses on high-performance retrieval and integrates relevant multimodal document information to effectively address this critical issue. To overcome the limitations of current text embedding similarity methods, we first employ a coarse-to-fine retrieval and ranking approach. The coarse phase calculates the similarity between the query and text chunk embeddings, while the fine phase involves multiple rounds of grouping and ordering with a large language model to identify the text chunks most relevant to the query. Subsequently, integrating an attention pooling mechanism for multi-page document images into the vision encoder allows us to effectively merge the visual information of multi-page documents, enabling the multimodal large language model(MLLM) to simultaneously process both single-page and multi-page documents. Finally, we apply various parameter-efficient tuning methods to enhance document visual question-answering performance. Experiments demonstrate that our approach secures state-of-the-art results across various document datasets.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Generation] Multimedia Foundation Models
Relevance To Conference: Our approach focuses on addressing multi-page document visual question-answering tasks.
1. We introduce a coarse-to-fine retrieval algorithm for extracting relevant text chunks and their corresponding document images from extensive documents.
2. Innovatively, we introduce a multimodal large language model tailored for multi-page document images, along with a dedicated multi-page document image encoder.
3. We enhance the existing multimodal large language model and incorporate retrieval-augmented generation, applying them to a novel field.
4. Our approach achieves state-of-the-art results in four of the five document datasets compared to other existing methods.
Supplementary Material: zip
Submission Number: 1757
Loading