ProactiveBench: Benchmarking Proactive Reasoning in Multimodal LLMs

10 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY-NC 4.0
Keywords: reasoning, multimodal large language models, proactivity
TL;DR: We propose a benchmark to evaluate multimodal large language models' proactiveness.
Abstract: How do multimodal large language models (MLLMs) respond when the object of interest in an image is partially or fully occluded? While a human would naturally ask follow-up questions or seek additional visual cues before arriving at the correct answer, do MLLMs exhibit similar “proactive” behavior by prompting the user for more information? Despite their growing use in human-machine collaborative settings, no existing benchmark systematically evaluates the proactiveness of MLLMs. To address this gap, we introduce ProactiveBench, a benchmark constructed from seven repurposed datasets tailored to evaluate the task at hand. Given that proactiveness can manifest itself in several forms, our benchmark involves recognizing occluded objects and individuals, enhancing image quality, and interpreting coarsely drawn sketches, to name a few. We evaluated 14 open-weight MLLMs on ProactiveBench and found that MLLMs generally lack proactiveness. Critical analyses reveal no clear correlation between model capacity and proactiveness. Adding “hints” in the query to encourage proactive suggestions only results in marginal performance improvement. Surprisingly, including conversation histories introduces negative biases in proposing actions. Overall, the experimental results show that instilling proactiveness in MLLMs is indeed challenging, and we hope that ProactiveBench will positively contribute to building more proactive models. Code and benchmark are available at: https://anonymous.4open.science/r/ProactiveBench.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/submission1331/ProactiveBench
Code URL: https://anonymous.4open.science/r/ProactiveBench/README.md
Supplementary Material: zip
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 1331
Loading