MC-Search: Benchmarking Multimodal Agentic RAG with Structured Reasoning Chains

Xuying Ning; Dongqi Fu; Tianxin Wei; Mengting Ai; Jiaru Zou; Ting-Wei Li; Jingrui He

MC-Search: Benchmarking Multimodal Agentic RAG with Structured Reasoning Chains

Xuying Ning, Dongqi Fu, Tianxin Wei, Mengting Ai, Jiaru Zou, Ting-Wei Li, Jingrui He

Published: 24 Sept 2025, Last Modified: 27 Nov 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Agentic RAG, Multimodal RAG, Agentic Reasoning, Agent Evaluation

Abstract: Retrieval-Augmented Generation (RAG) has become a key paradigm for grounding multimodal large language models (MLLMs) in external evidence. Current MM-RAG benchmarks, however, emphasize simplified QA tasks with shallow reasoning depth, falling short in evaluating agentic RAG behaviors such as iterative planning and retrieval. We present MC-Search, a benchmark with golden, hop-wise reasoning chains that specify sub-questions, retrieval modalities, supporting facts, and intermediate answers, enabling fine-grained analysis of retrieval planning and reasoning accuracy. To ensure fidelity, we propose HAVE, a hop-wise verification procedure that filters hallucinated or redundant steps. MC-Search covers five representative reasoning structures and consists of 3,333 high-quality examples. We further develop an agentic MM-RAG pipeline and introduce three chain-level metrics to jointly assess answer accuracy and intermediate retrieval fidelity. Experiments benchmark MLLMs under this framework, revealing key challenges in modality-aware planning and the trade-off between retrieval effectiveness and efficiency. The code is available at https://github.com/YennNing/MC-Search.

Submission Number: 186

Loading