MMPCBench: Benchmarking Multimodal Large Language Models on Proactive Critique of Flawed Inputs

MMPCBench: Benchmarking Multimodal Large Language Models on Proactive Critique of Flawed Inputs

ACL ARR 2026 January Submission9959 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmarking, evaluation methodologies, evaluation

Abstract: As Multimodal Large Language Models (MLLMs) evolve into sophisticated interactive assistants, their reliability depends not only on following instructions but also on validating them. We term this capability \textit{Proactive Critique}—the ability of a model to autonomously detect, diagnose, and resolve erroneous user inputs without explicit prompting. However, current evaluations primarily assess performance under ideal conditions or focus on simple refusal behaviors, largely overlooking the complexity of active error handling and the consistency of model reasoning. To bridge this gap, we introduce \textbf{MMPCBench}, a holistic framework designed to evaluate the proactive reliability of MLLMs. MMPCBench features a fine-grained taxonomy of 12 error subcategories, ranging from cross-modal contradictions to missing visual premises—constructed through a rigorous multi-stage filtration pipeline. Beyond standard accuracy, we propose a hierarchical evaluation protocol that measures error detection, diagnostic precision, and strategic effectiveness. Crucially, we introduce novel alignment-aware metrics to quantify the consistency between a model's internal reasoning and its final response. Our extensive evaluation of 14 MLLMs reveals that current models struggle significantly with proactive critique, particularly with subtle visual anomalies. Notably, we uncover a pervasive ``consistency gap'': models often correctly identify errors during internal reasoning yet suppress these insights in their final outputs to maintain compliance.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, evaluation methodologies, Large Multimodal Models (LMMs)

Contribution Types: Data resources

Languages Studied: English

Submission Number: 9959

Loading