Keywords: reasoning, multimodal large language models, proactivity
TL;DR: We propose a benchmark to evaluate multimodal large language models' proactiveness.
Abstract: How do multimodal large language models (MLLMs) handle images where the object of interest is partially or fully occluded? While a human would naturally ask follow-up questions or seek additional visual cues before answering, do MLLMs exhibit similar “proactive” behavior by prompting the user for more information? Despite their growing use in collaborative settings, no benchmark currently evaluates the proactiveness of MLLMs. To fill this gap, we introduce ProactiveBench, a benchmark built from seven repurposed datasets to evaluate proactiveness across tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches, to name a few. We evaluated 21 MLLMs on ProactiveBench and found that they generally lack proactiveness. Model capacity shows no clear correlation with proactiveness, and adding “hints” in the query to elicit proactive suggestions yields only marginal gains. Surprisingly, conversation histories and in-context learning introduce negative biases, hindering performance. Overall, our results highlight the challenge of instilling proactiveness in MLLMs, with ProactiveBench being a first step toward building more proactive models.
Primary Area: datasets and benchmarks
Submission Number: 19025
Loading