RECON: Multimodal GraphRAG for Visually Rich Documents with Intra-Page Reflection and Inter-Page Connection

13 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Retrieval augmented generation, knowledge graph-based augmented generation, multimodal knowledge graph, multimodal large language model, large language model
Abstract: Multimodal large language models (MLLMs) are widely applied to visual question answering (VQA) for visual documents. However, their ability to comprehend long documents is still constrained by the limited context window. Though recent multimodal retrieval-augmented generation (MMRAG) can assist in retrieving relevant pages to address this challenge, it struggles with the questions that require holistic comprehension of the entire document. To cope with this, knowledge graph (KG) that summarizes global knowledge of a document provides an effective solution to enhance the QA performance. However, most existing LLM-based KG-construction methods handle the language modality only; automatically constructing multimodal KGs (MMKGs) for visually rich documents remains largely underexplored. To tackle this issue, we introduce a Multimodal Graph-RAG approach (namely, RECON), which constructs MMKGs in two stages. (1) Intra-page REflection: it iteratively extracts and reflects both textual and visual entity relations within each page, which is adaptable to the page-content complexity; and (2) Inter-page CONnection: it links multimodal relations across pages to form a coherent global graph. The lack of annotated cross-page global VQA datasets, specifically query-focused visual document summaries (QFVDS), also hinders effective model evaluations. We further build a QFVDS dataset with annotated answers and corresponding supporting facts to enable effective evaluation. Experimental results show that RECON outperforms existing MMRAG approaches on various VQA datasets and QFVDS.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4710
Loading