# meta-rg-s2b

Minimal standalone reproduction of **Section 4.4 / Table 4** from:

> *Meta-Referential Games to Learn Compositional Learning Behaviours*

The experiment tests whether LLMs (Mixtral-8x7B-Instruct, GPT-4o-mini) act as
capable **listeners** in Meta-Referential Games, paired with a fixed rule-based
Positionally-Disentangled Speaker.  Both models perform **below 50% chance-level**,
establishing S2B as an open challenge.

---

## What this repo does

| Script | Purpose |
|---|---|
| `run_eval.py` | Zero-shot LLM listener evaluation → reproduces Table 4 |
| `run_grpo.py` | GRPO + LoRA fine-tuning (RLVR future work) |

---

## Dependencies

**Required** (already installed in the project venv):

> **Install the bundled S2B environment first** — this repo ships the S2B-LM
> extension under `SymbolicBehaviourBenchmark/`.  Install it in editable mode
> before anything else:
> ```bash
> pip install -e SymbolicBehaviourBenchmark/
> ```

- `torch`, `transformers`, `peft`, `accelerate`, `gym`, `numpy`, `tqdm`, `pyyaml`, `wandb`

**Optional** (install the backend you need):
```bash
pip install openai                # OpenAI API  (GPT-4o-mini etc.)
pip install vllm                  # vLLM local inference
pip install llama-cpp-python      # llama.cpp CPU inference
pip install bitsandbytes          # 4-bit quantisation for HF models
```

Activate the project venv:
```bash
source ../p311TheRockLM_venv/bin/activate
```

---

## Quick start

### Reproduce Table 4 — GPT-4o-mini
```bash
export OPENAI_API_KEY=sk-...
python run_eval.py --config configs/eval/gpt4o_mini.yaml --table4 \
    --n_seeds 5 --n_episodes 64
```

### Reproduce Table 4 — Mixtral via vLLM (local)
```bash
python run_eval.py --config configs/eval/mixtral_vllm.yaml --table4 \
    --n_seeds 5 --n_episodes 64
```

### Single condition (quick test)
```bash
python run_eval.py --config configs/eval/smollm_llamacpp.yaml \
    --o 1 --shots 1 --n_seeds 1 --n_episodes 4
```

### Expected results (Table 4)

|  | O=1, S=1 | O=1, S=2 | O=4, S=1 | O=4, S=2 |
|---|---|---|---|---|
| Mixtral | 45.6 ± 10.5 | 49.3 ± 13.7 | 48.6 ± 17.6 | 49.9 ± 7.9 |
| GPT-4o-mini | 33.1 ± 11.4 | 36.8 ± 12.7 | 39.8 ± 11.6 | 42.9 ± 3.8 |

Chance level = 50%.  Both models perform below chance, showing the benchmark
is a genuine challenge even for frontier LLMs.

---

## Backends

| Config | Backend | Notes |
|---|---|---|
| `configs/eval/gpt4o_mini.yaml` | `openai` | OpenAI API |
| `configs/eval/mixtral_hf.yaml` | `hf` | HuggingFace Transformers, 4-bit |
| `configs/eval/mixtral_vllm.yaml` | `vllm` | vLLM offline or server mode |
| `configs/eval/smollm_llamacpp.yaml` | `llamacpp` | llama.cpp, CPU-friendly |

To use a **running vLLM server** (e.g., `vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1`),
set `mode: server` and `base_url: http://localhost:8000/v1` in the config.

### Adding a new backend
1. Create `meta_rg/backends/my_backend.py` subclassing `BaseBackend`
2. Implement `generate(self, prompt_text: str) -> str`
3. Register in `meta_rg/backends/__init__.py`

---

## GRPO / RLVR (future work)

Fine-tune a small LLM as the listener using Group Relative Policy Optimisation + LoRA:

```bash
python run_grpo.py --config configs/grpo/smollm.yaml
```

Evaluate a saved checkpoint:
```bash
python run_grpo.py --config configs/grpo/smollm.yaml \
    --eval_only --checkpoint outputs/grpo/smollm_listener/step_500
```

---

## Scripts

All runnable shell scripts live in `scripts/`. They are wrappers around `run_eval.py` and `run_grpo.py` with pre-set defaults; any `--key value` flag accepted by the underlying Python entry point can be passed as an override.

### Batch evaluation — all prover models (categorical domain)

```bash
bash scripts/eval_all_prover_models_categorical_o1-s1-PV16-L512-Seed1-Ep4-FewShotDiscCot-N10-InductiveVerb.sh
```

Runs every prover model config in sequence (`DeepSeek-Prover V1/V1.5/V2`, `Goedel-Prover SFT/DPO/V2`, `Kimina-Prover`) on the categorical domain with `few_shot_discussion_cot` + inductive verbaliser.  Uses the `p311EvoX2_venv` virtual environment.

### Single-model categorical evaluation scripts

Each `eval_<model>_categorical_*.sh` script targets one model with fixed defaults.  They all accept `--key value` overrides, e.g.:

```bash
# Run with defaults
bash scripts/eval_deepseek_prover_v2_hf_api_categorical_o1-s2-PV16-L512-Seed3-Ep4.sh

# Override n_episodes and prompt strategy
bash scripts/eval_goedel_prover_v2_8b_hf_categorical_o1-s2-PV16-L512-Seed3-Ep4.sh \
    --n_episodes 16 --prompt_strategy zero_shot_cot
```

**Filename encoding** — the suffix in each script name encodes its default settings:

| Token | Meaning |
|---|---|
| `o1` / `o4` | `--o` (number of objects) |
| `s1` / `s2` | `--shots` (few-shot examples) |
| `PV16` | `--vocab_size 16` |
| `L512` | `--max_new_tokens 512` |
| `Seed3` | `--n_seeds 3` |
| `Ep4` | `--n_episodes 4` |
| `FewShotDiscCot` | `--prompt_strategy few_shot_discussion_cot` |
| `N10` | `--n_few_shot_games 10` |
| `InductiveVerb` | `--inductive_verbaliser` flag |

Available single-model scripts:

| Script | Model | Domain | Notes |
|---|---|---|---|
| `eval_deepseek_prover_v2_hf_api_categorical_o1-s2-PV16-L512-Seed3-Ep4.sh` | DeepSeek-Prover-V2 (HF API) | categorical | `discussion_cot` default |
| `eval_deepseek_prover_v2_7b_local_hf_scs_o1-s1-PV16-L512-Seed1-Ep4-FewShotDiscCot-N10-InductiveVerb.sh` | DeepSeek-Prover-V2-7B (local HF) | SCS | `few_shot_discussion_cot` + inductive verbaliser |
| `eval_goedel_prover_sft_hf_categorical_o1-s2-PV16-L512-Seed3-Ep4.sh` | Goedel-Prover-SFT (HF) | categorical | |
| `eval_goedel_prover_sft_llamacpp_categorical_o1-s2-PV16-L512-Seed3-Ep4.sh` | Goedel-Prover-SFT (llama.cpp) | categorical | CPU-friendly |
| `eval_goedel_prover_v2_32b_hf_categorical_o1-s2-PV16-L512-Seed3-Ep4.sh` | Goedel-Prover-V2-32B (HF API) | categorical | |
| `eval_goedel_prover_v2_8b_hf_categorical_o1-s2-PV16-L512-Seed1-Ep2-FewShotDiscCot-N10.sh` | Goedel-Prover-V2-8B (HF) | categorical | quick 2-episode run |
| `eval_goedel_prover_v2_8b_hf_categorical_o1-s2-PV16-L512-Seed3-Ep4.sh` | Goedel-Prover-V2-8B (HF) | categorical | |
| `eval_kimi_k2_hf_api_categorical_o1-s2-PV16-L512-Seed3-Ep4.sh` | Kimi-K2 (HF API) | categorical | |
| `eval_llama31_8b_hf_api_categorical_o1-s2-PV16-L512-Seed1-Ep4.sh` | Llama-3.1-8B (HF API) | categorical | |

### GRPO training

```bash
bash scripts/train_grpo_smollm.sh
```

Fine-tunes SmolLM2-1.7B-Instruct as a listener via GRPO + LoRA (see `configs/grpo/smollm.yaml`).  To evaluate a saved checkpoint afterwards:

```bash
python run_grpo.py --config configs/grpo/smollm.yaml \
    --eval_only --checkpoint outputs/grpo/smollm_listener/step_500
```

---

## Repository layout

```
s2b_lm/
├── run_eval.py                  ← main evaluation entry point
├── run_grpo.py                  ← GRPO training entry point
├── configs/
│   ├── base.yaml                ← shared S2B defaults
│   ├── eval/                    ← one yaml per LLM backend
│   └── grpo/                    ← GRPO training configs
├── meta_rg/
│   ├── s2b_import.py            ← pybullet mock + S2B gym registration
│   ├── env_utils.py             ← env factory, action helpers
│   ├── game_loop.py             ← single-game and episode-level loops
│   ├── metrics.py               ← ZSCT accuracy aggregation
│   ├── agents/
│   │   └── rule_based.py        ← Posdis-Speaker wrapper
│   ├── backends/
│   │   ├── openai_backend.py
│   │   ├── hf_backend.py
│   │   ├── vllm_backend.py
│   │   └── llamacpp_backend.py
│   └── training/
│       └── grpo_trainer.py      ← GRPOTrainer class
├── scripts/                     ← convenience shell scripts (see Scripts section)
├── tests/                       ← unit tests + experimental scripts
└── SymbolicBehaviourBenchmark/  ← vendored S2B environment library
```

---

## Citation

```bibtex
@article{anonymous2024meta,
  title={Meta-Referential Games to Learn Compositional Learning Behaviours},
  author={Anonymous},
  year={2024}
}
```

The S2B environment:
```bibtex
@software{SymbolicBehaviourBenchmark,
  author = {Anonymous},
  title  = {Symbolic Behaviour Benchmark},
}
```
