# Multi-Agent Game 🎮 Generation and Evaluation via Audio-Visual Recordings 📹

This repo contains the code for the paper "Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings".

This work contains an evaluation metric and a multi-agent framework for multimedia content generation (video games and animations).

See `/experiments_simple` for simple base code on how to run the AVR-Agent and the AVR-Eval. We also provide details on the commands and framework below.

<p align="center">
  <img src="images/before_after.png" alt="Before-After"  style="width: 90%;">
</p>

## Table of Contents

- [Features](#features)
- [AVR-Agent](#avr-agent)
- [AVR-Eval](#avr-eval)
- [Requirements](#requirements)
- [Experiments](#experiments)
- [Quick Start](#quick-start)
- [Usage Examples](#usage-examples)
  - [AVR-Eval: Relative Content Evaluation](#avr-eval-relative-content-evaluation) 
  - [AVR-Agent: Content Generation](#avr-agent-content-generation)
- [Key Parameters](#key-parameters)
- [Output Files](#output-files)
- [Citation](#citation)
  
## Features

- **AVR-Eval**: Relative evaluation metric comparing video+audio A to video+audio B to determine which content (A or B) is best.
- **AVR-Agent**: Multi-agent framework leveraging multimedia assets and Audio-Visual Recording (AVR) for JavaScript web games and animations generation.

## AVR-Agent

The framework involves 2 agents to making and improving multimedia content based on Audio-Visual Recordings (AVR).

<p align="center">
  <img src="images/VGAgent2.png" alt="AVR-Agent"  style="width: 75%;">
</p>

**Workflow:**
1. **Initial Generation**: Coding agent creates initial content based on description and assets
2. **Recording**: System records gameplay/animation video with audio and console logs
3. **Evaluation**: Evaluator agent analyzes the recording and provides feedback
4. **Improvement**: Coding agent iteratively improves content based on feedback
5. **Selection**: Best-of-k mechanism selects optimal candidates (when enabled)

## AVR-Eval

A relative evaluation can be done by comparing the AVR of two content.

<p align="center">
  <img src="images/VGAgent1.png" alt="AVR-Eval"  style="width: 55%;">
</p>



## Requirements

- Python 3.10 (or similar)
- Cuda 12.6.0 (or similar)
- Linux (could work with Windows with modifications)

Python requirements:
```bash
pip install --upgrade pip wheel setuptools
pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install transformers==4.52.3 vllm # only transformers version that handle qwen2.5-omni
pip install selenium
pip install pygltflib librosa soundfile pandas
pip install --upgrade openai
pip install mistral-common --upgrade
```

### Chromium Browser Setup
```bash
cd ${HOME}
mkdir -p chromium
cd chromium
wget https://storage.googleapis.com/chrome-for-testing-public/136.0.7103.94/linux64/chrome-linux64.zip
wget https://storage.googleapis.com/chrome-for-testing-public/136.0.7103.94/linux64/chromedriver-linux64.zip
unzip chrome-linux64.zip
unzip chromedriver-linux64.zip
chmod +x chromedriver-linux64
```

### OpenRouter

Using (OpenRouter)[https://openrouter.ai/] is recommended if you do not have the hardware to process large coding models. You can buy credits on their website; then, using your api key, you can query any model automatically at the cheapest provider. The best coding models in the experiments are Kimi-K2 (1T parameters) and Qwen3-Coder (480B parameters). Since they are open-source, they are much cheaper than closed-source models. Note that if you do have the hardware, you can run models locally in the code.

### Linux System Dependencies

**With sudo rights:**
```bash
sudo apt-get install alsa-utils pulseaudio xvfb ffmpeg 
```

**Without sudo rights:** If you are unlucky like me, you have to manually install alsa, pulseaudio, and ffmpeg. I left the code that I used in manual_installation.sh, it's for my setup on the Mila cluster; it might not work for you.
Once this is installed, I highly recommend testing the audio-video recording functionality before using the code. The recording system involves complex interactions between multiple components (Xvfb, PulseAudio, FFmpeg, Chromium), and system configurations can vary significantly.

Run the test script to verify your setup:
```bash
output_folder="my_games" # your folder containing the output games and animations
avr_folder="AVR_Eval_Agent" # your folder containing the github code
chromium_path=${HOME}/chromium # your path to chromium
display_num=100

# Load local server
cd ${output_folder}
python3 -m http.server &

# Load audiopulse
OUTPUT=$(${avr_folder}/setup_pulseaudio_sink.sh ${display_num})
MONITOR_SOURCE=$(echo "$OUTPUT" | grep MONITOR_SOURCE | cut -d'=' -f2)

# Test
python ${avr_folder}/test_video_recording.py --duration 5 --with_audio --chromium_path ${chromium_path} --display_num ${display_num} --monitor_source $MONITOR_SOURCE --server_root ${output_folder}
```
If the test fails, check the output for specific error messages and ensure all dependencies are properly installed.

#### How it works

Audio-Visual Recordings are tricky for a headless Linux setup (a server with no screen or speakers). Chromium is the browser used through Selenium in Python. To render the audio without speakers, you need Alsa and Pulseaudio. To render the video without a screen, you need the X virtual framebuffer (Xvfb). With those, you can use FFMEPG to record video+audio from a browser without a screen, speaker, or GPU. Note that currently, the code manually records the video, then the audio. I found this necessary because I had syncing issues otherwise. It's easy to merge audio and video when needed.

Note that if you do not have Linux or you have different setup, you might need to modify video_utils.py accordingly.

### Assets

You can use any assets that you want. Just make sure that every folder inside ./assets are folders of asset packs. Since there is no RAG, the names of both the asset packs and the names of the assets should be somewhat meaningful (good: pony.png, cat.png, dog.png; bad: tile1.png, tile2.png, tile3.png). 

The links to the assets used in the paper can be found in assets.txt. All assets have a permissive license, but they do not always allow redistribution, so I cannot share them directly. Out of respect for the artists, you have to manually go to each link and click to download them one by one. Then, you have to extract them into your assets folder. 

Credits go to [domi.wav](https://domiwav.itch.io/) (Dominic Sandefur), [David KBD](https://davidkbd.itch.io/), [TomMusic](https://tommusic.itch.io/) (Thomas Devlin), [Yogi](https://yogi-tronimal.itch.io/) (Tronimal), [OmegaPixelArt](https://omegapixelart.itch.io/), [doranarasi](https://doranarasi.itch.io/), and [Kenney](https://kenney.nl/assets) for their high-quality assets. The following licenses are used: David KBD assets have "CC By 4.0", Kenney assets have "CC0", TomMusic asset mention "No resale, redistribution", doranarasi mentions "No resale, redistribution, NFT", and the remaining ones have no license.

To ensure that the folder and file names work well with Linux, I recommend replacing all spaces into "-". You can use my powershell script powershell_script_convert_assets_into_linux_compatible.ps (or any simple bash script) to convert empty spaces and other symbols into "-".

## Experiments

The code for getting results with Deepseekv3, Gemini, Kimi-K2, Grok3-mini, Qwen3-Coder are in `/experiments_paper`. Note that in the paper, we also trained on more models.

We provide simple code to run the AVR-Agent KimiK2 and Qwen3-Coder on games and then evaluation with AVR-Eval in `/experiments_simple`.

## Quick Start

### 1. Start Model Servers

You can load a model as a separate server to support multiple agents and parallel processing (your own HuggingFace models or APIs like OpenRouter). 

We need to load the text/coding agent and the omni-modal agent.

**For 2 GPUs:**
```bash
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-32B \
  --dtype bfloat16 \
  --api-key token-1 \
  --max-model-len 32768 \
  --gpu_memory_utilization 0.9 \
  --tensor_parallel_size 1 \
  --port 8001 &

CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen2.5-Omni-7B \
  --dtype bfloat16 \
  --api-key token-2 \
  --max-model-len 32768 \
  --gpu_memory_utilization 0.9 \
  --tensor_parallel_size 1 \
  --trust-remote-code \
  --port 8002 &
```

### 2. Setup your folders and load pulseaudio

```bash
output_folder="my_games" # your folder containing the output games and animations
avr_folder="AVR_Eval_Agent" # your folder containing the github code
api_key="YOUR-OpenRouter-API" # your openrouter api key
chromium_path=${HOME}/chromium # your path to chromium

# Start local server
cd ${output_folder}
python3 -m http.server &

OUTPUT=$(${avr_folder}/setup_pulseaudio_sink.sh ${display_num})
MODULE_ID=$(echo "$OUTPUT" | grep MODULE_ID | cut -d'=' -f2)
MONITOR_SOURCE=$(echo "$OUTPUT" | grep MONITOR_SOURCE | cut -d'=' -f2)
```

### 3. Generate Your First Game

```bash
# make a directory for your game
current_dir=${output_folder}/game1
mkdir $current_dir
cd $current_dir

python ${avr_folder}/video_game_builder.py \
  --use_vllm_server \
  --model_path Qwen/Qwen3-32B \
  --vllm_server_url http://localhost:8001 \
  --api_key token-1 \
  --use_separate_evaluator \
  --evaluator_model_path Qwen/Qwen2.5-Omni-7B \
  --evaluator_vllm_server_url http://localhost:8002 \
  --evaluator_api_key token-2 \
  --content_description "A simple Pong game with two paddles and a bouncing ball" \
  --min_iterations 3 \
  --max_iterations 5 \
  --video_duration 10 \
  --video_fps 2 \
  --enable_audio \
  --output_dir . \
  --display_num ${display_num} \
  --monitor_source ${MONITOR_SOURCE} \
  --chromium_path ${chromium_path} \
  --server_root ${output_folder}
```

## Usage Examples

### AVR-Eval: Relative Content Evaluation
```bash
# make a directory for your game
current_dir=${output_folder}/compare_game1_vs_game2
mkdir $current_dir
cd $current_dir

python ${avr_folder}/evaluate_content.py \
  --folders ${output_folder}/game1 \
  --folders_paired ${output_folder}/game2 \
  --use_vllm_server \
  --model_path Qwen/Qwen3-32B \
  --vllm_server_url http://localhost:8001 \
  --api_key token-1 \
  --use_separate_evaluator \
  --evaluator_model_path Qwen/Qwen2.5-Omni-7B \
  --evaluator_vllm_server_url http://localhost:8002 \
  --evaluator_api_key token-2 \
  --content_description "A space shooter game" \
  --output_dir . \
  --enable_audio --relative --multiround --coding_evaluation
```

#### Extracting the second description from a csv and saving the evaluation to the csv
```bash
python ${avr_folder}/video_game_builder.py \
--row_index 2 \
--dataset YOUR-FOLDER/data/video_games_short.csv \
...
```

### AVR-Agent: Content Generation

#### Animation Generation
```bash
python ${avr_folder}/video_game_builder.py \
  --content_type animation \
  --content_description "A bouncing ball animation with realistic physics and colorful trails" \
  ...
```

#### Extracting the second description from a csv
```bash
python ${avr_folder}/video_game_builder.py \
--row_index 2 \
--dataset YOUR-FOLDER/data/video_games_short.csv \
...
```

#### Using External Assets
```bash
python ${avr_folder}/video_game_builder.py \
  --asset_dir ./my_assets \
  --select_assets \
  --max_sample_packs 5 \
  --assets_selection individual \
  --max_assets 50 \
  ...
```

#### Using OpenRouter API
```bash
python ${avr_folder}/video_game_builder.py \
  --use_vllm_server \
  --model_path google/gemini-2.5-flash \
  --vllm_server_url https://openrouter.ai/api \
  --api_key ${api_key} \
  ...
```

#### Best-of-K Generation
Generate multiple candidates and select the best one:
```bash
python ${avr_folder}/video_game_builder.py \
  --best_of_k 1 \
  --initial_best_of_k 5 \
  ...
```

#### Memory System
Enable the system to remember past improvements:
```bash
python ${avr_folder}/video_game_builder.py \
  --use_memory \
  --memory_len 5 \
  ...
```

#### Search and Replace Mode
Allow targeted code modifications:
```bash
python ${avr_folder}/video_game_builder.py \
  --search_replace \
  ...
```

#### Auto-Resume Functionality
Automatically resume interrupted sessions:
```bash
python ${avr_folder}/video_game_builder.py \
  --auto_resume \
  --output_dir ./my_project \
  ...
```

## Key Parameters

### Content Generation (`video_game_builder.py`)

| Parameter | Description | Default |
|-----------|-------------|---------|
| `--content_type` | Type of content (video-game, animation, website) | video-game |
| `--content_description` | Description of content to create | Required |
| `--min_iterations` | Minimum improvement iterations before it stops when there are no console logs error | 10 |
| `--max_iterations` | Maximum improvement iterations | 100 |
| `--initial_best_of_k` | Best-of-k for initial generation only | 1 |
| `--use_memory` | Enable memory system | False |
| `--memory_len` | Number of past memories to keep | 3 |
| `--search_replace` | Enable search/replace code modifications | False |
| `--early_exit` | Allow model to exit early if satisfied | False |
| `--auto_resume` | Resume from previous progress | False |
| `--enable_audio` | Enable audio recording and processing | False |
| `--video_duration` | Recording duration in seconds | 20 |
| `--video_fps` | Recording frames per second | 1 |

### Content Evaluation (`evaluate_content.py`)

| Parameter | Description | Default |
|-----------|-------------|---------|
| `--folders` | Folders to evaluate | Required |
| `--folders_paired` | Folders to compare against | None |
| `--relative` | Use relative evaluation | True |
| `--multiround` | Use multiround evaluation | False |
| `--coding_evaluation` | Use coding agent for evaluation review | False |
| `--description_feedback` | Include description feedback | False |

## Output Files

After running the system, you'll find these files in your output directory:

- `final_content.html` - The final generated game/animation
- `final_content.mp4` - Video recording of the final content
- `final_content.wav` - Audio recording (if audio enabled)
- `final_content_console_logs.txt` - Browser console logs
- `initial_content.html` - The initial generated content
- `temp_content_N.html` - Content at each iteration N
- `temp_content_N.mp4` - Video at each iteration N
- `evaluation_results_*.txt` - Evaluation feedback
- `memory_state.json` - Memory state (if memory enabled)
- `resume_state.json` - Resume state (if auto-resume enabled)

