# WorldTest Protocol
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

An implementation of AutumnBech in WorldTest Protocol. The protocol is designed to be an interface to test world-model learning agents in a variety of environments.

## Table of Contents
- Installation
- Examples

## Installation
### Requirements
- Python >= 3.12
- `protoc` >= 3.0  
- macOS/Linux (Windows instructions TBD)

To install these dependencies and generate the corresponding Python stubs and library for implementing environments and agents, please execute the following scripts:

```bash
sh scripts/generate_python.sh     # generates gRPC stubs into ./generated
```

### Obtaining the interpreter
```sh
python3 -m venv ./venv
source activate ./venv/bin/activate
pip install -r requirements.txt
pip install -r autumn.wasm/requirements.txt

# Install cmake
brew install cmake
sudo apt install cmake

cd autumn.wasm && mkdir -p build && cd build
cmake ..
make -j12
cd ../../
cp autumn.wasm/build/*.so python_examples/autumnbench/
```


### Setting up environment variables
Following this, please create an `.env` file following the `.env_sample`. This file is mostly used for providing credentials for LLM Agents.
Our supported LLM Agents includes (and will grow if needed): Ollama, OpenAI, Claude, MLX, Gemini. We also support framework OpenRouter framework for easily switch between different LLM providers.

You can also setup the API key directly, for example:
```bash
export OPENROUTER_API_KEY="YOUR_API_KEY_HERE"
```

We list the corresponding providers belows for quick start.

| **Provider**       | **Docs / Quick-start**                                                                         |
| ------------------ | ---------------------------------------------------------------------------------------------- |
| Ollama             | [REST API reference](https://ollama.readthedocs.io/en/api/)                                    |
| OpenAI             | [API key setup](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety) |
| Claude (Anthropic) | [Anthropic Developer Docs](https://docs.anthropic.com/)                                        |
| Apple MLX          | [MLX GitHub](https://github.com/ml-explore/mlx)                                                |
| Google Gemini      | [Gemini API docs](https://ai.google.dev/gemini-api/docs)                                       |
| OpenRouter         | [OpenRouter Quickstart](https://openrouter.ai/docs/quickstart)                                 |

Drop these variables into your local `.env`; the Protocol tooling will load them automatically at runtime.

## Running AutumnBench's agent and environment
[AutumnBench environments](./python_examples/autumnbench/)'s implementation includes Masked Frame Prediction (MFP), Change Detection (or Change Detection) and Planning, along with three types of agents: Random agent, LLM-based agent, and an agent that's built with AutumnSynth in mind.

### Inspecting dataset
We provide the downloaded version of our dataset at `python_examples/autumnbench/example_benchmark/`. The dataset contains the tasks, programs, and prompts needed for the benchmark.

### AutumnBench Baselines
We also provide an example benchmark for Autumn that consists of a single program, this is stored in [Example Benchmark](./python_examples/autumnbench/example_benchmark/).

Once this is done, you can run the agent with either one of the following agents. Note that, protobuf codes are originally meant for creating gRPC for a language-agnostic interface. However, for simplicity, we usethem locally.


### Running Agents
```bash
python -m python_examples.autumnbench.run_no_server +experiment=debug data_dir=$(pwd)/python_examples/autumnbench/example_benchmark
```

More configurable parameters can be found in `python_examples/autumnbench/conf/config.yaml`. You can either specify a new experiment config in `conf/experiments/` or specify them at runtime. For example, to change the render mode you can simply run the following command.

```bash
python -m python_examples.autumnbench.run_no_server +experiment=debug data_dir=$(pwd)/python_examples/autumnbench/example_benchmark render_mode=image
```

If you would like to run with another model (say Claude 4 Opus) on OpenRouter, you can do the following:

```bash
python -m python_examples.autumnbench.run_no_server +experiment=debug data_dir=/python_examples/autumnbench/example_benchmark="anthropic/claude-opus-4"
```

You can also configure the environments you want to run on by changing the list in the `envs` parameter.  For the `task_name` parameter the options supported as `(mfp, cd, planning)` (for Masked Frame Prediction, Change Detection, and Planning respectively). You can also specify `all` to run all the environments.

The main agent is the `UnifiedReactAgent` defined in [`llm_agent.py`](./python_examples/autumnbench/llm_agent.py), with some of the prompts defined in [`prompts.py`](./python_examples/autumnbench/prompts.py). The task type themselves are defined in [`concrete_envs.py`](./python_examples/autumnbench/concrete_envs.py). Adding a new environment should be done by adding it to the `AutumnBenchmark` repo directly.

We currently provide the following three agents:
- "autumn_llm_unified_interactive_agent_v1" # LLM-based agent
- "autumn_random_interactive_agent_v1"      # Random agent
- "autumn_simple_wm_agent"                  # Oracle autumnSynth agent

You can select the default desired agent in [`config.yaml`](`python_examples/autumnbench/conf/config.yaml`).

The results will be put in './experiments' folder by default. You can change this by specifying the `output_dir` parameter in the config file or at runtime.
