R$^2$AG: Rethinking Retrieval-Augmented Generation via Multimodal Coherence Understanding

Jing Tang; Weizhi Du; Siyuan Yuan; Jiechao Gao

R$^2$AG: Rethinking Retrieval-Augmented Generation via Multimodal Coherence Understanding

Jing Tang, Weizhi Du, Siyuan Yuan, Jiechao Gao

10 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Retrieval-Augmented Generation; Retrieval-Augmented Generation; Coherence Understanding

Abstract: While Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in enhancing large language models, existing approaches are difficult to accurately capture the complex structural relationships between text and visual elements. This paper introduces R$^2$AG (Rethinking Retrieval-Augmented Generation), a novel framework that extends MRAG to multimodal multi-level property graphs, significantly improving multimodal coherence understanding. Our approach represents multimodal content as interconnected nodes and edges in a property graph, capturing semantically rich relationships beyond conventional embedding distances. To address the exponential growth of graph complexity with additional hops, we propose an Implicit Chain-of-Thought (Implicit-CoT) technique that efficiently partitions and analyzes local subgraphs while deriving comprehensive features from both node semantics and structural properties. Additionally, we develop an improved graph matching algorithm that not only considers feature consistency but also recognizes semantic approximations and prioritizes rare entities, enhancing matching accuracy and robustness. Extensive experiments on public datasets demonstrate that R$^2$AG outperforms state-of-the-art methods in multiple tasks requiring deep multimodal coherence understanding. Our code is available at \url{https://anonymous.4open.science/r/R2AG-4F58/}.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 3544

Loading