<h1 align="center">VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments</h1>

## 📝 Overview

![overview](images/Overview.jpg)

We introduce **VS-Bench**, a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. **VS-Bench** comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: **perception** measured by element recognition accuracy; **strategic reasoning** measured by next-action prediction accuracy; and **decision-making** measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision-making, with the best-performing model attaining 46.6% prediction accuracy and 31.4% normalized return.



## 📦 Installation
Setup a Conda Environment:
```bash
conda create -n vs-bench python=3.10 -y
conda activate vs-bench
pip install -r requirements.txt
```

## ⚡ Quickstart

To run a minimal example, first set the `OPENAI_API_KEY` environment variable using your own OpenAI API key:

```bash
export OPENAI_API_KEY=<your_api_key>
```

Next, you can run the following command to evaluate the decision-making ability of GPT-4.1 in the Tic-Tac-Toe environment:

```bash
python main.py --eval decision-making --exp tic_tac_toe
```

The results of this experiment, including the episode returns, images of each step in the match, and GPT-4.1's responses, will be saved in the `./results/decision-making` directory.


## 🚀 Experiments

Our evaluation considers three dimensions: perception, strategic reasoning and decision-making.

### Perception

We provide 400 samples for each environment to evaluate the perception capability of VLMs. You can download the **VS-Bench** dataset from [Hugging Face](https://huggingface.co/datasets/VS-Bench/VS-Bench) and place it in the `./data/` directory. Note that the `perception` folder is specifically used for testing perception.


Next, run the following command to evaluate strategic reasoning:
```bash
python main.py --eval perception --exp <exp_name>
```
Replace `<exp_name>` with one of the environment name provided in the `./configs/env_configs` directory.


### Strategic Reasoning

We provide 400 samples for each environment to evaluate the perception capability of VLMs. You can download the **VS-Bench** dataset from [Hugging Face](https://huggingface.co/datasets/VS-Bench/VS-Bench) and place it in the `./data/` directory. Note that the `reasoning` and `text_reasoning` (without visual information) folders are specifically used for testing strategic reasoning.

Next, run the following command to evaluate strategic reasoning:
```bash
python main.py --eval strategic-reasoning --exp <exp_name>
```
Replace `<exp_name>` with one of the environment name provided in the `./configs/env_configs` directory.

### Decision-Making

To evaluate decision-making ability, run the following command:
```bash
python main.py --eval decision-making --exp <exp_name>
```
Replace `<exp_name>` with one of the experiment name provided in the `./configs/exp_configs` directory.

The default configuration file for each `<exp_name>` is located at `./configs/exp_configs/<exp_name>.yaml`. Below is the configuration file for Tic-Tac-Toe:

```yaml
experiment:
  name: default
  seed: 0
  async_mode: true
  num_episodes: 10
  results_dir: results

environment: tic_tac_toe

agents:
  - type: prompt_agent
    params:
      model: gpt-4.1
      visual_obs: true

  - type: mcts_agent
```
By default, the VLM is set to GPT-4.1. To use a different VLM, change the model parameter in the configuration file. All available VLMs can be found in the `./configs/model_configs/` directory.

We offer two different VLM agent types:
- `prompt_agent` (let the VLM only output the action)
- `cot_agent` (let the VLM think step by step)

Additionally, to compare VLM performance with traditional algorithms, we provide three baseline agents:

- `random_agent`
- `mcts_agent` (for board games)
- `cfr_agent` (for card games)


## Human Evaluation

We provide complete scripts for evaluating human-level performance by allowing human players to directly participate in the game.  
For single-player experiments, the game can be launched on a single computer. For multi-player settings (two or more players), we recommend using the same number of computers as players. All computers should be connected to a shared directory, with one machine acting as the **host** and the others as **clients**.  
In addition to running the client processes, the host must launch an extra main function responsible for transmitting information to all clients.


### YAML Configuration

First, set the `user_terminal_path` to the shared directory where each player will read the latest game state and related information.  
Next, configure the corresponding game YAML file to use human agents and synchronous mode. Specifically, set `async_mode` to `false` and specify `human_agent` as the agent type. For example:


```yaml
experiment:
  name: default
  async_mode: False
  results_dir: results_human

user_terminal_path: /YOUR/SHARE/DIRECTORY

environment:
  - simple_push:
      num_episodes: 5
      seed: 1
      agents:
        - type: "human_agent:0"        
        - type: "builtin_agent"
```


### Multiplayer Setup

Assume there are two players: `player0` and `player1`.

On **one player's machine**, open two terminal windows:

In the **first terminal**, run the decision-making evaluation:
```bash
python main.py --eval human-hci --exp human
```

In the **second terminal**, run:
```bash
python user.py --player 0
```

On the **other player's machine**, run:
```bash
python user.py --player 1
```
