Search-MM: Benchmarking Multimodal Agentic RAG with Structured Reasoning Chains

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Agentic RAG, Multimodal RAG, Agentic Reasoning, Agent Evaluation
Abstract: Retrieval-Augmented Generation (RAG) has become a key paradigm for grounding multimodal large language models (MLLMs) in external evidence. Current MM-RAG benchmarks, however, emphasize simplified QA tasks with shallow reasoning depth, falling short in evaluating agentic RAG behaviors such as iterative planning and retrieval. We present Search-MM, a benchmark with golden, hop-wise reasoning chains that specify sub-questions, retrieval modalities, supporting facts, and intermediate answers, enabling fine-grained analysis of retrieval planning and reasoning accuracy. To ensure fidelity, we propose HAVE, a hop-wise verification procedure that filters hallucinated or redundant steps. Search-MM covers five representative reasoning structures and consists of 3,333 high-quality examples. We further develop an agentic MM-RAG pipeline and introduce three chain-level metrics to jointly assess answer accuracy and intermediate retrieval fidelity. Experiments benchmark MLLMs under this framework, revealing key challenges in modality-aware planning and the trade-off between retrieval effectiveness and efficiency.
Submission Number: 186
Loading