MFCL Vision: Benchmarking Tool Use in Multimodal Large Language Models for Visual Reasoning Tasks

Published: 08 Nov 2025, Last Modified: 30 Nov 2025NeurIPS 2025 Workshop NORA OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Function Calling Evaluation, Tool use, Large Language Models, Multimodal Agents, Vision Language Models
Abstract: As multimodal large language models become tool-using agents, the field still lacks a standardized metric for translating visual inputs into correct tool invocations. We introduce MFCL Vision, the first large-scale benchmark for vision-based function calling, comprising 250 expert-verified tasks across five image domains (Places, Events, Media, Sports, Shopping) and five query types (Locate, Temporal, Select, Identify, Quantify). Each task comprises (1) a textual user query, (2) an accompanying image, (3) a ground-truth answer obtained from the web, and (4) a human-produced reasoning trace for comparative error analysis. To constrain the task, we expose a singular web-search tool to each model. To examine the robustness of multimodal LLMs' perception-to-tool-use pipeline, we introduce controlled visual perturbations, including crops, resizes, and color channel removal. Our automatic grader computes exact-match scores on model final answers, removing dependence on brittle and potentially biased LLM judges. We evaluate leading models and present a taxonomy of failure modes, including visual reasoning, assumption bias, keyword selection, and tool avoidance errors. By releasing MFCL Vision’s dataset, taxonomy, and diagnostics, we aim to accelerate progress towards versatile multimodal agents capable of intelligent tool usage in complex visual contexts.
Submission Number: 3
Loading