MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark

MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark

ACL ARR 2025 February Submission5939 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Evaluating instruction following capabilities in multimodal, multi-turn dialogue presents significant challenges, particularly when multiple instructions are distributed throughout the conversation. Current evaluation approaches often rely either on time-intensive human ratings or LLM-based judges, which we show have systematic bias toward responses from their own model family. We address these challenges by introducing MMMT-IF, a benchmark that augments image-based question-answering with global answer format instructions distributed between conversation turns. All instructions are verifiable through code execution, enabling objective evaluation. To measure performance, we introduce the Programmatic Instruction Following (PIF) metric, which quantifies the fraction of correctly followed instructions during reasoning tasks. This metric shows 60\% correlation with human ratings, validating its reliability. Evaluation of leading models (Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet) reveals significant performance degradation as conversations progress, with average PIF scores dropping from 0.81 at turn 1 to 0.64 at turn 20. Model performance deteriorates significantly when testing for consistency; when generating four responses per turn, GPT-4o and Gemini successfully follow all instructions only 11\% of the time. Notably, when instructions are appended to the conversation end rather than distributed throughout, PIF scores improve by 22.3 points on average, indicating that retrieving multiple instructions from different parts of the input context, rather than instruction following itself, is the major challenge. The MMMT-IF dataset and metric computation code will be open-sourced.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: multimodal QA, benchmarking, evaluation methodologies, metrics, retrieval

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 5939

Loading