VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

Zelai Xu; Zhexuan Xu; Xiangmin Yi; Huining Yuan; Xinlei Chen; Yi Wu; Chao Yu; Yu Wang

VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang

04 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision language model, multi-agent, strategic reasoning, decision-making

TL;DR: We introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark to evaluate VLMs for strategic reasoning and decision-making in multi-agent environments.

Abstract: Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting under rich visual and language observations, posing challenges with both multimodal perceptions and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLM agents for strategic reasoning and decision-making in multi-agent environments. VS-Bench comprises eight vision-grounded environments spanning cooperative, competitive, and mixed-motive interactions, designed to assess agents' ability to infer other agents' future moves and optimize long-term objectives. We consider two complementary evaluation dimensions, including offline evaluation of strategic reasoning by next-action prediction accuracy and online evaluation of decision-making by normalized episode return. Extensive experiments of fourteen leading VLMs reveal a significant gap between current models and optimal performance, with the best model achieving 45.8% average prediction accuracy and 26.3% average normalized return. We further conduct in-depth analyses on multimodal input, social dilemma behaviors, and failure cases of VLM agents. By highlighting the limitations of existing models, we envision our work as a foundation for future explorations in strategic multimodal agents. Code and data are available at https://sites.google.com/view/vs-bench-nips.

Croissant File: json

Dataset URL: https://kaggle.com/datasets/1adb7686abdd9cb20d9b6f51d66fcce8195af96ce9adb62b86a23d497ddd84ec

Code URL: https://anonymous.4open.science/r/VS-Bench-0515

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 487

Loading