﻿#  COMMA : A Communicative Multimodal Multi-Agent Benchmark

![Local Image](./assets/agents.png)

COMMA is a novel benchmark designed to evaluate the collaborative performance of mul-
timodal multi-agent systems through language communication.

We assess multi-modal multi-agent systems using a series of carefully designed collaborative puzzle
games. These scenarios typically involve two-player setups where agents have access to different,
complementary information.

Our benchmark features 10 customizable puzzles with thousands of solutions. We assessed AI-AI and AI-Human settings, testing popular multimodal models like closed-source (GPT-4V, GPT-4O, GPT-4o1) and open-source (Qwen-VL, InternVL). Notably, the GPT models did not surpass a basic random baseline in the AI-AI scenario, indicating room for improvement.

## 🚀Quickstart

### Installation
    $ conda create -n comma python=3.10
    $ conda activate comma
    $ pip install -r requirements.txt

To evaluate model predictions on COMMA, you could specify the Solver and Expert agents in `./config/experiment_config.json`:

```
[
    {
        "Expert": {
            "file_path": "agents/gpt4o_agent.py",
            "class_name": "GPT4oAgent"
        }
    },
    {
        "Solver": {
            "file_path": "agents/humanAgent.py",
            "class_name": "HumanAgent"
        }
    }
]
```

Then run the script:

```
$ cd scripts/
$ bash run.py
$ # use --gui if you want to run with GUI
```

### Test on Remote Server

This section aims to help you run the experiment on your remote server, especially when it's not with a GUI. 

1. **Install Docker**. Follow the instructions in the [Docker setup guide](https://docs.docker.com/engine/install/) to install Docker on your machine. 
2. **Enter a Docker Container**.
    ```
    docker run -it --rm -p5900:5900 ubuntu:20.04
    ```
3. **Install the X component**.
    ```
    apt update
    apt install -y xserver-xorg
    apt install xvfb
    apt install x11vnc
    ```
4. **Run script with a virtual screen**. For instance:
    ```
    xvfb-run main.py
    ```
5. (Optional) Use a VNC server to see the screen.
   1) Open a new terminal and run `ps -ef |grep auth`.
    Then we find the location of Auth file:
    ```
    root@13785a282294:/# ps -ef |grep auth
    root        7417    7408  1 11:47 pts/0    00:00:00 Xvfb :99 -screen 0 1280x1024x24 -nolisten tcp -auth /tmp/xvfb-run.RCwemo/Xauthority
    root        7449    5837  0 11:47 pts/1    00:00:00 grep --color=auto auth
    ```
    `/tmp/xvfb-run.RCwemo/Xauthority` is the path of Auth file, which is generated randomly for each time.

    `:99` is the screen number of the virtual screen. It is default to be 99.

   2) **Start the vnc server**.
   ```
   x11vnc -display :99 -auth /tmp/xvfb-run.RCwemo/Xauthority #Replace the path with your Auth file
   ```
   x11vnc listens on port `5900` by default.

   3) **Using a VNC client e.g. `TightVNC`, `RealVNC Viewer` to see the screen**.

### Deploy your models
Use `agents\template.py` as a generic agent template to test your own models on COMMA.

### Summarize Results After Experiments
By default, running the experiments will save the conversations between agents to a folder called outputs. You can summarize the results based
on the conversations in an output folder with the following command:

```
python summarize_results.py --result_folder <path_to_your_folder_containing_agent_conversations>
```

## 🥇Leaderboard

| Rank  |      Solver/Expert      | Success Rate % (↑) |
| :---: | :---------------------: | :----------------: |
|   1   |     Random/InternVL     |         56         |
|   2   |       GPT4V/GPT4V       |         53         |
|   3   |       GPT4o/GPT4o       |         50         |
|   4   |     InternVL/GPT4o      |         46         |
|   5   |    QwenVL2b/QwenVL2b    |         38         |
|   6   |   InternVL8b/QwenVL7b   |         36         |
|   7   |     QwenVL7b/GPT4o      |         36         |
|   8   | InternVL26b/InternVL26b |         35         |
|   9   |   QwenVL7b/InternVL8b   |         33         |
|  10   |    QwenVL7b/QwenVL7b    |         32         |
|  11   |  InternVL8b/InternVL8b  |         30         |