ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

ICLR 2026 Conference Submission19025 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reasoning, multimodal large language models, proactivity

TL;DR: We propose a benchmark to evaluate multimodal large language models' proactiveness.

Abstract: How do multimodal large language models (MLLMs) handle images where the object of interest is partially or fully occluded? While a human would naturally ask follow-up questions or seek additional visual cues before answering, do MLLMs exhibit similar “proactive” behavior by prompting the user for more information? Despite their growing use in collaborative settings, no benchmark currently evaluates the proactiveness of MLLMs. To fill this gap, we introduce ProactiveBench, a benchmark built from seven repurposed datasets to evaluate proactiveness across tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches, to name a few. We evaluated 21 MLLMs on ProactiveBench and found that they generally lack proactiveness. Model capacity shows no clear correlation with proactiveness, and adding “hints” in the query to elicit proactive suggestions yields only marginal gains. Surprisingly, conversation histories and in-context learning introduce negative biases, hindering performance. Overall, our results highlight the challenge of instilling proactiveness in MLLMs, with ProactiveBench being a first step toward building more proactive models.

Primary Area: datasets and benchmarks

Submission Number: 19025

Loading