VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

04 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision language model, multi-agent, strategic reasoning, decision-making
TL;DR: We introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark to evaluate VLMs for strategic reasoning and decision-making in multi-agent environments.
Abstract: Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting under rich visual and language observations, posing challenges with both multimodal perceptions and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLM agents for strategic reasoning and decision-making in multi-agent environments. VS-Bench comprises eight vision-grounded environments spanning cooperative, competitive, and mixed-motive interactions, designed to assess agents' ability to infer other agents' future moves and optimize long-term objectives. We consider two complementary evaluation dimensions, including offline evaluation of strategic reasoning by next-action prediction accuracy and online evaluation of decision-making by normalized episode return. Extensive experiments of fourteen leading VLMs reveal a significant gap between current models and optimal performance, with the best model achieving 45.8% average prediction accuracy and 26.3% average normalized return. We further conduct in-depth analyses on multimodal input, social dilemma behaviors, and failure cases of VLM agents. By highlighting the limitations of existing models, we envision our work as a foundation for future explorations in strategic multimodal agents. Code and data are available at https://sites.google.com/view/vs-bench-nips.
Croissant File: json
Dataset URL: https://kaggle.com/datasets/1adb7686abdd9cb20d9b6f51d66fcce8195af96ce9adb62b86a23d497ddd84ec
Code URL: https://anonymous.4open.science/r/VS-Bench-0515
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 487
Loading