# Setup Instructions for ICLR Code Submission

This document provides quick setup instructions for reviewers and researchers to run the entropy-based early stopping experiments.

## Quick Start (5 minutes)

### 1. Install Dependencies
```bash
cd iclr_code
pip install -r requirements.txt
```

### 2. Add Your API Key
Edit any of the experiment templates and replace:
```python
OPENROUTER_API_KEY = "your_openrouter_key"  # Replace with actual key
```

### 3. Run a Quick Test
```bash
# Test with small sample (uses built-in sample problems)
python templates/aime24_experiment.py --model gpt-3.5-turbo --problems 3
```

## File Overview

| File | Purpose |
|------|---------|
| `entropy_framework.py` | Core implementation of Shannon entropy calculation and threshold methods |
| `templates/aime24_experiment.py` | AIME'24 mathematical reasoning experiment |
| `templates/aime25_experiment.py` | AIME'25 cross-year validation |
| `templates/gpqa_experiment.py` | GPQA Diamond scientific reasoning |
| `analysis/visualization_toolkit.py` | Publication-quality plotting tools |
| `README.md` | Complete documentation |

## Key Results to Reproduce

Based on our paper, you should expect these results:

### AIME'24 Results
- **Token Savings**: 25-45% across models
- **Threshold Accuracy**: 88-100% for reasoning-optimized models  
- **Cohen's d**: 1.5-2.0 for mathematical problems

### GPQA Diamond Results
- **Token Savings**: 35-50% across models
- **Cross-Domain Consistency**: Effect sizes > 0.7 across Physics/Chemistry/Biology
- **Threshold Accuracy**: 92-95% for advanced models

### Key Findings
1. **Emergent Property**: Standard instruction-tuned models (like Llama 3.3 70B) show negligible entropy discrimination (Cohen's d < 0.2)
2. **Reasoning Models**: Post-training optimized models show strong entropy-based confidence (Cohen's d > 0.5)
3. **Universal Thresholds**: Mean entropy of correct answers provides reliable threshold across domains

## Expected Runtime

- **Sample Problems** (3-5 problems): 2-5 minutes
- **Full AIME Dataset** (30 problems): 15-30 minutes  
- **Full GPQA Sample** (198 problems): 45-90 minutes

Times depend on model speed and API rate limits.

## Troubleshooting

### Common Issues
1. **API Errors**: Check your OpenRouter/OpenAI API key and credits
2. **Import Errors**: Run `pip install -r requirements.txt`
3. **No Logprobs**: Some models don't support logprobs - try GPT-3.5/4
4. **Low Effect Sizes**: Expected for non-reasoning-optimized models

### Model Recommendations

**Works Well** (Strong entropy discrimination):
- GPT-4, GPT-3.5-turbo
- GPT OSS series (if available)
- Qwen3-30B-A3B-Instruct
- Other reasoning-optimized models

**Limited Results** (Weak entropy discrimination):
- Base pretrained models  
- Standard instruction-tuned models without reasoning optimization
- Very small models (< 7B parameters)

## Directory Structure

```
iclr_code/
├── entropy_framework.py           # Core framework
├── templates/                     # Experiment templates
│   ├── aime24_experiment.py      # Mathematical reasoning
│   ├── aime25_experiment.py      # Cross-year validation  
│   └── gpqa_experiment.py        # Scientific reasoning
├── analysis/                      # Analysis tools
│   └── visualization_toolkit.py  # Plotting utilities
├── data/                         # For your datasets
├── experiments/                  # Experiment outputs
├── requirements.txt              # Dependencies
├── README.md                     # Full documentation
└── SETUP_INSTRUCTIONS.md        # This file
```

## Validation Checklist

To verify the implementation works correctly:

- [ ] Framework loads without errors
- [ ] Sample experiment runs successfully  
- [ ] Entropy calculations produce reasonable values (0.1-2.0 bits typically)
- [ ] Threshold calibration completes
- [ ] Visualizations generate correctly
- [ ] Results match expected patterns from paper

## Support

For technical issues with the code:
1. Check the README.md for detailed documentation
2. Verify your environment meets the requirements
3. Test with smaller datasets first
4. Ensure API keys are correctly configured

This implementation provides a complete, reproducible framework for entropy-based early stopping in LLM reasoning tasks as described in our paper.