Keywords: Function Calling Evaluation, Tool use, Large Language Models, Multimodal, Audio, Vision
Abstract: Large language models are evolving into multi-modal agents that call tools directly from raw speech or images. Yet we still lack a principled metric for how well they convert perception into accurate function calls. We introduce \textbf{MFCL}, the first large-scale benchmark for \emph{Multi-modal Function Calling}, comprising \textbf{8.2K} expert-verified tasks across three suites—\textbf{True Audio}, \textbf{Text Audio}, and \textbf{Vision}. Each example pairs a multi-modal user query with a ground-truth tool-call trace. To examine different capabilities of the LLM's perception-to-action pipeline, we introduce controlled perturbations: for audio, accents, contractions, simplified forms, casual pronouns, slang, disfluencies (fillers, hesitations, repetitions), and background noise; for images, crops and resizes, occlusions, grayscale and other color shifts, and related transformations. Image crops and resizes, occlusions, black-and-white and other color filters, etc for images. Our automatic grader computes exact-match scores for both function names and their arguments, removing dependence on brittle LLM judges and isolating errors in perception, reasoning, and formatting. We evaluate leading models and present a taxonomy of failure models: named-entity ASR errors, conversational drift, and tool avoidance. By releasing MFCL's dataset, taxonomy, and diagnostics, we hope to accelerate research on multi-modal agents that can effectively invoke tools.
Primary Area: datasets and benchmarks
Submission Number: 16308
Loading