# CausalGame Benchmark

A benchmark for evaluating causal reasoning capabilities of LLM-based agents through interactive decision-making scenarios.

## Overview

CausalGame presents agents with black-box optimization tasks where success requires discovering hidden causal structures rather than exploiting spurious correlations. The benchmark includes multiple scenario families designed to test different aspects of causal reasoning:

- **Antenna Trap**: Signal emission creates a causal trap where intuitive solutions fail
- **Deployment Zone Trap**: Inspired by Farr's Cholera Paradox with hidden confounders
- **Weather Noise**: Environment-dependent observation noise challenges

## Project Structure

```
├── api/                    # Backend simulation engine (FastAPI)
│   ├── modules/
│   │   ├── agent/          # Session management, action space
│   │   ├── environment/    # SCM registry and implementations
│   │   └── game/           # Judgment logic, combat simulation
│   └── middleware/         # DroneSheet (single source of truth)
├── agent/                  # LLM agent implementations
│   ├── handlers/           # Provider-specific handlers (OpenAI, Anthropic, etc.)
│   ├── tools/              # Tool definitions for hybrid mode
│   └── sandbox/            # Analysis execution sandbox
├── experiments/            # Scenario configurations and prompts
├── experiment_configs/     # Batch experiment configurations
├── config/                 # Runtime configuration files
├── run_agent.py            # Single agent runner
└── run_experiments.py      # Batch experiment runner
```

## Requirements

- Python 3.11+
- API keys for LLM providers (set via environment variables):
  - `OPENAI_API_KEY`
  - `ANTHROPIC_API_KEY`
  - `GEMINI_API_KEY`
  - `XAI_API_KEY` (for Grok)

## Installation

```bash
pip install -r agent_requirements.txt
```

## Usage

### Running a Single Agent

```bash
# Start the backend server
uvicorn api.app:app --port 8000

# Run an agent (in another terminal)
python run_agent.py --model gpt-4o --experiment antenna_trap --mode hybrid
```

### Running Batch Experiments

```bash
python run_experiments.py --config experiment_configs/default.json
```

## Execution Modes

| Mode | Description |
|------|-------------|
| **Hybrid (Agentic)** | Structured tool calling with ReAct pattern enforcement |
| **Legacy (Prompting)** | Python code execution with injected client object |

## Scenarios

| Experiment | Victory Threshold | Challenge |
|------------|-------------------|-----------|
| antenna_trap | 75% | Discover that functional antenna causes detection |
| deployment_zone_trap_categorical | 75% | Identify hidden EMI as true cause, not altitude |
| weather_noise | 55% | Handle environment-dependent observation noise |

Each base scenario has variants testing specific causal reasoning challenges:
- `*_no_history`: No historical data (pure exploration)
- `*_no_selection_bias`: Control condition
- `*_simpsons_paradox`: Aggregate trends reverse in subgroups
- `*_local_optima`: Gradient traps
- `*_high_def`: Additional confounding pressure

## License

This code is released for academic research purposes.
