# Tree of Thought with Iterative Self-Correction

Tree-structured reasoning with iterative self-correction across four autonomy levels (L1-L4).

## Overview

- Tree of Thought (ToT): Multi-path reasoning with branching
- Iterative Self-Correction: Error detection and regeneration
- Autonomy Levels: L1 (Oracle), L2 (Binary), L3 (Full Autonomy), L4 (Historical Context)
- Chain Caching: Shared initial chains across autonomy levels

## Quick Start

### Batch Evaluation (Tree + Self-Correction)

```bash
# L3 (Full Autonomy) on MATH-500
python batch_eval.py \
    --autonomy-level 3 \
    --dataset math500 \
    --model llama8b \
    --gpus 0,1 \
    --tensor-parallel-size 2 \
    --n-problems 100

# L2 (Binary Feedback) on GSM8K
python batch_eval.py \
    --autonomy-level 2 \
    --dataset gsm8k \
    --model qwen7b \
    --gpus 0,1 \
    --tensor-parallel-size 2

# L1 (Oracle) on all AMC23 problems
python batch_eval.py \
    --autonomy-level 1 \
    --dataset amc23 \
    --model llama8b \
    --gpus 0,1 \
    --tensor-parallel-size 2 \
    --n-problems 40

# L3 with Incremental Error Detection (step-by-step verification)
python batch_eval.py \
    --autonomy-level 3 \
    --dataset math500 \
    --model llama8b \
    --gpus 0,1 \
    --tensor-parallel-size 2 \
    --n-problems 100 \
    --error-detection incremental

# L3 without shared prefix (step-level error localization + full regeneration)
# Useful for ablation studies: tests if prefix preservation is key to ToT's success
python batch_eval.py \
    --autonomy-level 3 \
    --dataset aime \
    --model llama8b \
    --gpus 0,1 \
    --tensor-parallel-size 2 \
    --n-problems 100 \
    --no-shared-prefix
```

### Baseline CoT Evaluation

```bash
# Single-shot CoT (no iteration)
python baseline_cot_eval.py \
    --baseline-type single \
    --dataset math500 \
    --model llama8b \
    --gpus 0,1 \
    --tensor-parallel-size 2 \
    --n-problems 100

# Iterative CoT with L3 autonomy
python baseline_cot_eval.py \
    --baseline-type iterative_l3 \
    --dataset gsm8k \
    --model qwen7b \
    --gpus 0,1 \
    --tensor-parallel-size 2

# Majority vote (generate 10 samples, take majority)
python baseline_cot_eval.py \
    --baseline-type majority_vote \
    --dataset math500 \
    --model llama8b \
    --gpus 0,1 \
    --tensor-parallel-size 2 \
    --max-iterations 10
```

## Autonomy Levels

### L1: Oracle
- Model sees the **correct answer** during error detection
- Upper bound on performance
- Prompt: "The correct answer should be {ground_truth}"

### L2: Binary Feedback
- Model knows answer is **wrong** but not what's correct
- Useful when verifiers available (e.g., code execution)
- Prompt: "Your answer is incorrect. Identify where the error occurred..."

### L3: Full Autonomy
- Model must **self-detect** errors without any feedback
- No external feedback required
- Prompt: "Verify your reasoning. If you find errors, identify the step..."

### L4: Historical Context
- Like L3, but includes **context from previous failed attempts**
- Tests if learning from error history improves correction
- Provides previous chain and error analysis during regeneration

## Supported Models

Specify with `--model`:

| Model | Size | Context | HuggingFace ID |
|-------|------|---------|----------------|
| `llama8b` (default) | 8B | 131K | `meta-llama/Llama-3.1-8B-Instruct` |
| `qwen7b` | 7B | 32K | `Qwen/Qwen2.5-7B-Instruct` |
| `llama3b` | 3B | 131K | `meta-llama/Llama-3.2-3B-Instruct` |
| `qwen2b` | 3B | 32K | `Qwen/Qwen2.5-3B-Instruct` |
| `phi4b` | 3.8B | 131K | `microsoft/Phi-3.5-mini-instruct` |

## Supported Datasets

Specify with `--dataset`:

| Dataset | Size | Source | Description |
|---------|------|--------|-------------|
| `math500` (default) | 500 | `HuggingFaceH4/MATH-500` | Competition math, levels 1-5 |
| `gsm8k` | 1,319 test | `gsm8k` | Grade school word problems |
| `amc23` | 40 | `math-ai/amc23` | AMC competition problems |
| `aime` | TBD | TBD | AIME competition problems |

Filter MATH-500 by level: `--level 5` (only level-5 problems)

## Key Arguments

### batch_eval.py
```bash
--autonomy-level {1,2,3,4}     # Required: Autonomy level
--gpus GPUS                    # Required: GPU IDs (e.g., "0,1")
--tensor-parallel-size N       # Required: Number of GPUs
--dataset {math500,gsm8k,amc23,aime}  # Dataset (default: math500)
--level LEVEL                  # MATH-500 difficulty filter (1-5)
--model {llama8b,qwen7b,...}  # Model to use (default: llama8b)
--n-problems N                 # Number of problems (default: 100)
--max-iterations N             # Max correction iterations (default: 10)
--error-detection {batch,incremental}  # Error detection: batch (default) or incremental
--no-shared-prefix             # Regenerate full solution (default: preserve correct prefix)
--experiment-name NAME         # Custom experiment name
```

### baseline_cot_eval.py
```bash
--baseline-type TYPE           # Required: single, iterative_l1-l4, majority_vote
--gpus GPUS                    # Required: GPU IDs
--tensor-parallel-size N       # Required: Number of GPUs
--dataset DATASET              # Same as batch_eval.py
--model MODEL                  # Same as batch_eval.py
--n-problems N                 # Number of problems (default: 100)
--max-iterations N             # For iterative/majority_vote (default: 10)
--temperature T                # Sampling temperature (default: 1.0)
--use-cached-chains            # Use cached ToT chains (default: True)
--regenerate-cot               # Force regenerate CoT chains
```

## Output Structure

Results are saved in `experiments/` with timestamp:

```
experiments/
  eval_L3_Autonomous_20251029_120000/
    ├── config.json              # Experiment configuration
    ├── results.json             # Problem-level results
    ├── summary.json             # Accuracy, iteration stats
    ├── iteration_curves.json    # Success by iteration
    └── results.log              # Execution log
```

### results.json format
```json
{
  "problem_id": "...",
  "correct": true,
  "num_iterations": 3,
  "initial_chain": [...],
  "iterations": [
    {
      "iteration": 1,
      "answer": "42",
      "error_step": 5,
      "chain": [...]
    }
  ]
}
```

## How It Works

### 1. Tree of Thought Generation
- Model generates text until `</thought>` stop token
- Nodes with `\boxed{answer}` marked as complete
- Single-path mode (n_rollouts=1) for iterative correction

### 2. Iterative Self-Correction Loop

```
1. Generate initial chain (Tree of Thought)
2. Extract answer from chain
3. If correct or max iterations reached → stop
4. Error Detection (varies by autonomy level):
   - L1: Model sees correct answer
   - L2: Model told answer is wrong
   - L3: Model must self-verify
   - L4: Model gets historical context
5. Regeneration (configurable with --no-shared-prefix):
   - Default: Backtrack to prefix before error, regenerate from there
   - With --no-shared-prefix: Regenerate entire solution from scratch
6. Repeat from step 2
```

### 3. Error Detection Methods

**Batch (Default)**: Single-pass error detection
- Model receives entire reasoning chain at once
- Identifies which step contains the error in one model call
- Faster but model must analyze all steps simultaneously

**Incremental**: Step-by-step verification
- Model traverses chain from top to bottom
- Verifies each step sequentially in context of previous steps
- Multiple model calls (one per step) until error found
- More explicit granularity but slower
- Useful for studying whether focused attention improves error detection

### 4. Chain Caching
- Initial chains cached for efficiency
- Shared across autonomy levels for fair comparison
- Cache key: model, dataset, n_problems, seed, temperature

## Programmatic Usage

```python
from tree_of_thought import ToTAgent, ToTEnvironment, TreeSearch, initialize_model

# Initialize model
manager = initialize_model(
    model_name="llama8b",
    gpu_ids=[0, 1],
    tensor_parallel_size=2
)

# Create tree search components
agent = ToTAgent(manager, temperature=1.0)
env = ToTEnvironment(max_depth=100)
search = TreeSearch(agent, env, strategy="dfs", n_rollouts=1)

# Generate reasoning tree
root = search.search("What is 15 + 27?", verbose=True)

# Get completed paths
from tree_of_thought import get_completed_paths
paths = get_completed_paths(root)

# Extract answer
from iterative_self_correction import extract_boxed_answer
answer = extract_boxed_answer(paths[0][-1])
```

## Files

### Core Implementation
- `tree_of_thought.py` - Tree generation with agent/environment API
- `iterative_self_correction.py` - Self-correction loop with L1-L4
- `chain_cache.py` - Initial chain caching system

### Evaluation Scripts
- `batch_eval.py` - Main evaluation with tree + self-correction
- `baseline_cot_eval.py` - Baseline CoT comparisons
- `batch_eval_with_cache.py` - Older cached evaluation (L1-L4)

### Utilities
- `dataset_loaders.py` - Unified dataset loading
- `plot_iteration_curves.py` - Plot success rates by iteration

### Documentation
- `tree_of_thought_formalization.md` - Mathematical formalization

## Performance

- **Single GPU**: 8B models need ~15GB VRAM
- **Tensor Parallel**: 2x GPUs for faster inference
- **Generation Speed**: ~2-3 steps/second with vLLM
- **Chain Caching**: First run generates chains, subsequent runs reuse

## Troubleshooting

**Out of Memory**: Use smaller model (`qwen2b`, `llama3b`) or reduce `--tensor-parallel-size`

**No Completed Paths**: Check if model outputs `\boxed{}` format, or increase max_depth

**Slow Generation**: Ensure `VLLM_USE_V1=1` is set, check GPU utilization with `nvidia-smi`

See `tree_of_thought_formalization.md` for the formal framework definition.
