Keywords: Multimodal Retrieval-Augmented Generation; Retrieval-Augmented Generation; Coherence Understanding
Abstract: While Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in enhancing large language models, existing approaches are difficult to accurately capture the complex structural relationships between text and visual elements. This paper introduces R$^2$AG (Rethinking Retrieval-Augmented Generation), a novel framework that extends MRAG to multimodal multi-level property graphs, significantly improving multimodal coherence understanding. Our approach represents multimodal content as interconnected nodes and edges in a property graph, capturing semantically rich relationships beyond conventional embedding distances. To address the exponential growth of graph complexity with additional hops, we propose an Implicit Chain-of-Thought (Implicit-CoT) technique that efficiently partitions and analyzes local subgraphs while deriving comprehensive features from both node semantics and structural properties. Additionally, we develop an improved graph matching algorithm that not only considers feature consistency but also recognizes semantic approximations and prioritizes rare entities, enhancing matching accuracy and robustness. Extensive experiments on public datasets demonstrate that R$^2$AG outperforms state-of-the-art methods in multiple tasks requiring deep multimodal coherence understanding. Our code is available at \url{https://anonymous.4open.science/r/R2AG-4F58/}.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3544
Loading