# Code as a Cognitive Lever: Measuring Information in LLM Reasoning Strategies

Research repository for studying how LLMs encode and process information across different reasoning strategies (natural language, code generation, code simulation). We evaluate 48 algorithmic problem types across 9 models to quantify the relationship between reasoning modality and task performance.

## Overview

This repository implements three categories of experiments:

1. **Performance Experiments** (`src/exps_performance/`) — Evaluates LLM accuracy on algorithmic tasks using three reasoning arms: natural language (NL), code execution, and code simulation. Covers arithmetic, dynamic programming, graph algorithms, and NP-hard problems across varying difficulty levels (2–20 digits).

2. **Control Experiments** (`src/exps_control_again/`) — Tests indistinguishability of translated vs. native NL reasoning via source discrimination (judge classifies traces as "native" or "translated"), embedding-based separability analysis, and linear classifier probes.

3. **Functional Experiments** (`src/exps_functional/`) — Tests translation additivity: whether NL translated from code provides the same task-relevant information as native NL reasoning.

## Installation

### Prerequisites

- Python 3.10+
- [uv](https://github.com/astral-sh/uv) package manager (recommended)

### Setup

```bash
# Install dependencies
uv sync

# Or with pip
pip install -e .
```

### Environment Variables

Create a `.env` file with your API keys:

```bash
OPENAI_API_KEY=your_openai_key
OPENROUTER_API_KEY=your_openrouter_key  # For accessing LLMs via OpenRouter
```

## Reproducing Results

All experiments use the OpenRouter API backend. Results are written to each experiment's `results/` directory. Figures are generated by dedicated scripts after data collection.

### 1. Performance Experiments

Runs all 48 problem types across models with 3 reasoning arms (NL, code execution, code simulation).

```bash
# Run a single model + seed
uv run python src/exps_performance/main.py \
  --root src/exps_performance/ \
  --backend openrouter \
  --model "google/gemini-2.0-flash-001" \
  --n 60 \
  --digits 2 4 6 8 10 12 14 16 18 20 \
  --kinds spp bsp edp gcp gcp_d tsp tsp_d ksp msp clrs30 add sub mul lcs rod knap ilp_assign ilp_partition ilp_prod \
  --temperature 0.1 --top_p 0.90 \
  --exec_code --controlled_sim \
  --batch_size 256 --checkpoint_every 256 \
  --seed 0 --resume --exec_workers 4

# Run all models x seeds (SLURM script)
bash src/exps_performance/scripts/prod_all.sh
```

**Models used in the paper** (each with seeds 0, 1, 2):
- `anthropic/claude-haiku-4.5`
- `google/gemini-2.0-flash-001`
- `google/gemini-2.5-flash`
- `openai/gpt-4o-mini`
- `mistralai/codestral-2508`
- `mistralai/mixtral-8x22b-instruct`

Results are saved as JSONL files under `src/exps_performance/results/<model>_seed<N>/`.

#### Generate performance figures

After collecting results:

```bash
# Main accuracy vs. difficulty + statistical analysis
uv run python src/exps_performance/scripts/statistical_analysis.py

# Accuracy vs. hardness plots
uv run python src/exps_performance/scripts/plot_accuracy_vs_hardness.py
```

### 2. Source Discrimination (Control)

Tests whether a judge model can distinguish native NL reasoning from code-to-NL translations.

```bash
# Run discrimination experiment
uv run python src/exps_control_again/run_source_discrimination.py \
  --n_samples 200 --seed 42
```

Requires performance experiment results in `src/exps_performance/results/` as source data.

#### Generate discrimination figures

```bash
# Discrimination by model bar plot
uv run python src/exps_control_again/scripts/generate_discrimination_plot.py

# Discrimination by task breakdown
uv run python src/exps_control_again/scripts/generate_discrimination_by_task.py

# Judge discrimination bar plot
uv run python src/exps_control_again/scripts/plot_judge_discrimination.py

# Embedding analysis
uv run python src/exps_control_again/scripts/embedding_analysis_large.py

# Linear classifier probe
uv run python src/exps_control_again/scripts/embedding_linear_classifier.py

# Translator separability
uv run python src/exps_control_again/scripts/translator_separability_experiment.py
```

### 3. Translation Additivity (Functional)

Tests whether translated NL provides the same information boost as native NL.

```bash
# Run additivity experiment
uv run python src/exps_functional/run_translation_additivity.py \
  --model "gemini-2.0-flash" --n_samples 100 --seed 42
```

Requires performance experiment results as source data.

#### Generate additivity figures

```bash
uv run python src/exps_functional/scripts/plot_translation_additivity.py
```

### 4. Run Tests

```bash
uv run pytest tests/
```

## Project Structure

```
├── src/
│   ├── exps_performance/             # LLM performance benchmark
│   │   ├── main.py                   # Main experiment runner
│   │   ├── arms.py                   # Reasoning strategies (NL, code, sim)
│   │   ├── analysis.py               # Statistical analysis utilities
│   │   ├── logger.py                 # JSONL logging & checkpointing
│   │   ├── problems/                 # Problem generators
│   │   │   ├── finegrained.py        #   Arithmetic (add/sub/mul), DP, ILP
│   │   │   ├── clrs.py              #   CLRS algorithm problems (48 types)
│   │   │   └── nphardeval.py         #   NP-hard problems (TSP, GCP, etc.)
│   │   ├── clrs/                     # CLRS algorithm implementations
│   │   └── scripts/                  # Production & analysis scripts
│   │
│   ├── exps_control_again/           # Source discrimination experiments
│   │   ├── run_source_discrimination.py  # Main discrimination experiment
│   │   ├── prompts/                  # Judge & translator prompts
│   │   │   ├── source_classifier.md  # Discrimination judge prompt
│   │   │   └── translator_native_10shot.md  # Code-to-NL translator prompt
│   │   └── scripts/                  # Embedding & classifier analyses
│   │
│   └── exps_functional/              # Functional property experiments
│       ├── run_translation_additivity.py  # Translation additivity test
│       └── scripts/                  # Plot generation
│
├── tests/
│   ├── unit/                         # Unit tests
│   ├── integration/                  # Integration tests
│   └── logistic/                     # Logistic regression tests
│
├── pyproject.toml                    # Project configuration
└── license                           # MIT License
```

## License

MIT License
