# README

#### Running the experiments
The primary experiment script is `experiments/run_llm_benchmarks.py`. To run using GPT-4o-mini (or any other GPT models) run:

```
python3 run_llm_benchmarks.py --model "gpt-4.1-mini" --openai  --simplify_method llm --timeout 1 --num_workers 4
```

```
python3 run_llm_benchmarks.py --model "Qwen/Qwen3-32B" --port 11433  --simplify_method llm --timeout 1 --num_workers 3
```

To run without the agent, simply add the `--no-agent` flag to the script.

The experiments were run using VLLM for Qwen. In another window, run
```
VLLM_USE_V1=0 vllm serve "Qwen/Qwen3-32B" --tensor-parallel-size=4 --port 11433 --max_model_len=20000 --tool-call-parser hermes
```

#### Recreating the tables
The data from previous runs is in the `experiments` directory, and can be analyzed using `analyze_data.ipynb`. The cells can be run as is to recreate the figures. 