Evaluating instruction following capabilities in multimodal, multi-turn dialogue presents significant challenges, particularly when multiple instructions are distributed throughout the conversation. Current evaluation approaches often rely either on time-intensive human ratings or LLM-based judges, which we show have systematic bias toward responses from their own model family. We address these challenges by introducing MMMT-IF, a benchmark that augments image-based question-answering with global answer format instructions distributed between conversation turns. All instructions are verifiable through code execution, enabling objective evaluation. To measure performance, we introduce the Programmatic Instruction Following (PIF) metric, which quantifies the fraction of correctly followed instructions during reasoning tasks. This metric shows 60% correlation with human ratings, validating its reliability. Evaluation of leading models (Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet) reveals significant performance degradation as conversations progress, with average PIF scores dropping from 0.81 at turn 1 to 0.64 at turn 20. Model performance deteriorates significantly when testing for consistency; when generating four responses per turn, GPT-4o and Gemini successfully follow all instructions only 11% of the time. Notably, when instructions are appended to the conversation end rather than distributed throughout, PIF scores improve by 22.3 points on average, indicating that retrieving multiple instructions from different parts of the input context, rather than instruction following itself, is the major challenge. The MMMT-IF dataset and metric computation code will be open-sourced.
Abstract:
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: multimodal QA, benchmarking, evaluation methodologies, metrics, retrieval
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 5939
Loading