Keywords: Retrieval Augmented Generation, Large-Language Model, Vision-Language Model, Multimodal Learning, Video Reasoning, Localization, Dataset, Benchamrk
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to improve large language models (LLMs) by grounding their outputs in external knowledge. However, progress in the multimodal domain remains limited, largely due to the lack of suitable benchmarks. Existing multimodal corpora are often built by merging unimodal datasets, which rarely support queries requiring multi-hop reasoning and thus reduce most tasks to single-modality, one-hop retrieval. To address this gap, we introduce $\textit{EgoFact}$, the first benchmark explicitly designed for multi-hop reasoning across visual and textual corpora. Success on $\textit{EgoFact}$ requires models to retrieve and integrate evidence spanning multiple modalities. We systematically evaluate existing RAG systems and uncover fundamental limitations in multimodal evidence integration and reasoning. Motivated by these findings, we propose a localization-first framework for cross-modal video reasoning that enables more precise evidence grounding and substantially improves reasoning accuracy. Extensive experiments demonstrate the effectiveness of our approach, establishing new state-of-the-art results on multimodal RAG tasks. Together, the benchmark and framework lay a foundation for advancing research in this emerging area and for building more reliable multimodal reasoning systems.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5270
Loading