# Judgment Operators: Code for Reproducibility

This repository contains code to reproduce the experiments from the paper "Judgment Operators for Multi-Agent Action Spaces".

## Overview

Judgment Operators (JO) provide execution-time governance for multi-agent LLM systems through:
- **Constraint checking**: Verify agent actions against policy specifications
- **Executable repair**: Generate corrective actions when violations detected
- **Precedent learning**: Accumulate reusable repair knowledge across episodes

## Repository Structure

```
jo_code/
├── README.md
├── requirements.txt
├── overcooked/                  # Overcooked domain (semantic constraints)
│   ├── adapter.py              # Environment wrapper
│   ├── constraints.py          # Constraint definitions (T4, H3)
│   ├── experiment.py           # Experiment runner
│   ├── judgment_operator.py    # Core JO implementation
│   └── llm_agent.py            # LLM agent + baselines
├── src/jo/                      # Protocol/Wikipedia domain
│   ├── judgment_operator.py    # JO for format/protocol constraints
│   ├── protocol_operator.py    # Protocol-specific operator
│   ├── protocol_evaluator.py   # Protocol task evaluation
│   ├── precedent_store.py      # Precedent storage and retrieval
│   └── agents.py               # Multi-agent coordination
├── code_style/                  # Code style domain
│   ├── run_experiment.py
│   └── src/
│       ├── jo_style.py
│       ├── style_checker.py
│       └── ast_equivalence.py
├── configs/
├── artifacts/
│   └── expert_gpt52_precedents.json  # Pre-trained teacher precedents
└── scripts/                     # Reproduction scripts
    ├── run_overcooked.py       # Tables 3, 4, 7
    ├── run_protocol.py         # Tables 3, 4
    ├── run_wikipedia.py        # Tables 3, 5
    └── run_transfer.py         # Table 6
```

## Setup

### 1. Install dependencies

```bash
pip install -r requirements.txt
playwright install chromium  # For WebArena
```

### 2. Set API keys

```bash
export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"      # For heterogeneous experiments
export TOGETHER_API_KEY="your-key"       # For Llama/Qwen models
export MOONSHOT_API_KEY="your-key"       # For Kimi model
```

### 3. Install Overcooked

```bash
pip install overcooked-ai
```

### 4. Setup WebArena-Wikipedia (Docker)

The Wikipedia experiments require a local Wikipedia mirror running on `http://localhost:9999/`.

For full setup instructions, see: https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md

Quick start:
```bash
# Pull and run the Wikipedia Docker container
docker run -d -p 9999:80 --name wikipedia \
    ghcr.io/web-arena-x/wikipedia-offline:latest

# Verify it's running
curl http://localhost:9999/wikipedia_en_all_maxi_2022-05/A/Main_Page
```

The container serves an offline Wikipedia snapshot. Tasks navigate this mirror via Playwright browser automation.

## Reproducing Paper Results

### Table 3: Constraint Enforcement Across Domains

```bash
# Overcooked (semantic constraints)
python scripts/run_overcooked.py --experiment table3

# Protocol (JSON schema constraints)
python scripts/run_protocol.py --experiment table3

# Wikipedia (format constraints)
python scripts/run_wikipedia.py --experiment table3
```

### Table 4: Composition Invariance

```bash
# Overcooked: Agent count scaling (K1 → K2)
python scripts/run_overcooked.py --experiment table4

# Protocol: Model heterogeneity (H1 → H3)
python scripts/run_protocol.py --experiment table4
```

### Table 5: Recurrent Pattern Learning

```bash
python scripts/run_wikipedia.py --experiment table5
```

### Table 6: Cross-Model Transfer

```bash
python scripts/run_transfer.py
```

Tests zero-shot transfer of GPT-5.2 precedents to 5 student model families.

### Table 7: External/Internal Baselines

```bash
python scripts/run_overcooked.py --experiment table7
```

Compares JO against external and internal baselines on Overcooked with T4+H3 constraints:
- Reflexion, CRITIC, LlamaGuard (self-correction methods)
- RBR (rule-based repair)
- Centralized Prompt (constraints embedded in agent prompt)

### Code Style Experiment

```bash
cd code_style && python run_experiment.py
```

## Key Components

### JudgmentOperator

The core governance mechanism:

```python
from overcooked.judgment_operator import JudgmentOperator, NoOperator
from overcooked.constraints import TaskSpec

# Define constraints
task_spec = TaskSpec(
    max_consecutive_stays=1,      # T4: No idle loops
    enforce_plate_timing=True,    # H3: Timing constraint
)

# Create operator
jo = JudgmentOperator(task_spec, mode="dynamic")

# Project agent action through operator
decision = jo.project(
    state=current_state,
    action=proposed_action,
    agent_id=0
)

if decision.outcome == OperatorOutcome.ALLOW:
    execute(proposed_action)
elif decision.outcome == OperatorOutcome.EDIT:
    execute(decision.repaired_action)
```

### Conditions

| Condition | Description |
|-----------|-------------|
| `NO` | No Operator (baseline) |
| `JO_static` | Fixed constraint rules |
| `JO_dynamic` | Full JO with precedent learning |

### Constraints

**Overcooked:**
- `T4`: Max consecutive STAY actions
- `H3`: Plate timing (no early pickup)

**Protocol:**
- Schema validation
- Quote verification
- Role separation

**Wikipedia:**
- Enumeration format
- Citation provenance
- Verbatim quotes

## Experiment Configuration

Default settings match paper (Appendix F):

| Domain | N | Seeds | Horizon/Steps |
|--------|---|-------|---------------|
| Overcooked | 200 | 10 | 80 steps |
| Protocol | 150 | 5 | 10 steps |
| Wikipedia | 200 | 2 | 10 steps |

Override via command line:
```bash
python scripts/run_overcooked.py --seeds 0,1,2 --experiment table3
```

## Cross-Model Transfer

Uses pre-trained precedents from GPT-5.2 teacher:
```python
jo = JudgmentOperator(task_spec, mode="dynamic")
jo.load_precedents("artifacts/expert_gpt52_precedents.json")
# Student model benefits from teacher's repair knowledge
```

## License

MIT License

## Citation

```bibtex
@inproceedings{judgment-operators-2026,
  title={Judgment Operators for Multi-Agent Action Spaces},
  author={Anonymous},
  booktitle={ICML},
  year={2026}
}
```
