# Supplementary Code for ICLR 2026 Submission

This repository contains the code implementation for the paper "Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning" submitted to ICLR 2026.

**Note**: This code is provided as supplementary material for anonymous review. All identifying information has been removed to maintain double-blind review requirements.

## Overview

Our framework introduces a novel entropy-based approach to achieve computational efficiency in large language model reasoning tasks. By using Shannon entropy from token-level logprobs as a confidence signal, we enable early stopping that achieves 25-50% computational savings while maintaining task accuracy.

## Key Features

- **Universal Framework**: Works across model families and reasoning domains
- **Four Threshold Methods**: Entropy Mean, Information-Theoretic Optimal, Bayesian Optimal, and Scale-Invariant Universal
- **Cross-Domain Validation**: Tested on mathematical competition problems (AIME) and graduate-level scientific reasoning (GPQA Diamond)
- **Emergent Confidence Calibration**: Demonstrates that entropy-based confidence represents an emergent property of advanced post-training optimization

## Repository Structure

```
iclr_code/
├── README.md                          # This file
├── requirements.txt                   # Python dependencies
├── entropy_framework.py               # Core entropy calculation and framework
├── templates/                         # Experiment templates
│   ├── aime24_experiment.py          # AIME'24 experiment template
│   ├── aime25_experiment.py          # AIME'25 experiment template
│   └── gpqa_experiment.py            # GPQA Diamond experiment template
├── analysis/                          # Analysis and visualization tools
│   └── visualization_toolkit.py      # Publication-quality plotting utilities
├── experiments/                       # Directory for experiment outputs
└── data/                             # Directory for datasets (user-provided)
```

## Installation

1. Extract the supplementary code package:
```bash
cd iclr_code
```

2. Install required dependencies:
```bash
pip install -r requirements.txt
```

3. Set up your API keys:
   - Edit the experiment templates to add your OpenRouter API key
   - Replace `"your_openrouter_key"` with your actual API key

## Quick Start

### 1. Basic Framework Usage

```python
from entropy_framework import EarlyStoppingFramework

# Initialize framework
framework = EarlyStoppingFramework()

# Example calibration data
calibration_data = [
    {'logprobs': [[...]], 'correct': True},
    {'logprobs': [[...]], 'correct': False},
    # ... more examples
]

# Calibrate threshold
stats = framework.calibrate(calibration_data, method="entropy_mean")

# Test stopping decision
test_logprobs = [[...]]  # Your token logprobs
result = framework.should_stop_early(test_logprobs, method="entropy_mean")

print(f"Should stop: {result.should_stop}")
print(f"Confidence: {result.confidence:.3f}")
```

### 2. Run AIME'24 Experiment

```bash
python templates/aime24_experiment.py --model gpt-4 --api-key YOUR_API_KEY
```

### 3. Run GPQA Diamond Experiment

```bash
python templates/gpqa_experiment.py --model gpt-4 --num-problems 10
```

### 4. Create Visualizations

```python
from analysis.visualization_toolkit import EntropyVisualization

viz = EntropyVisualization()

# Plot entropy distributions
viz.plot_entropy_distributions(
    correct_entropies=[...], 
    incorrect_entropies=[...],
    save_path="entropy_dist"
)
```

## Experiment Templates

### AIME'24 Template (`templates/aime24_experiment.py`)

Demonstrates the framework on mathematical competition problems:

**Usage:**
```bash
python templates/aime24_experiment.py --model MODEL_NAME [--api-key API_KEY] [--problems N]
```

**Features:**
- 4-step sequential reasoning process
- Automatic answer extraction and evaluation
- Comprehensive threshold analysis
- Token savings calculation

### AIME'25 Template (`templates/aime25_experiment.py`)

Extended template with cross-year validation:

**Usage:**
```bash
python templates/aime25_experiment.py --model MODEL_NAME [--steps 4] [--step-tokens 8192]
```

**Features:**
- Enhanced statistical analysis
- Cross-year consistency validation
- Detailed entropy statistics

### GPQA Diamond Template (`templates/gpqa_experiment.py`)

Scientific reasoning validation:

**Usage:**
```bash
python templates/gpqa_experiment.py --model MODEL_NAME [--num-problems N]
```

**Features:**
- Cross-domain analysis (Physics, Chemistry, Biology)
- Multiple choice answer extraction
- Subject-wise performance breakdown

## Core Framework (`entropy_framework.py`)

### Classes

#### `EntropyCalculator`
- Calculates Shannon entropy from token logprobs
- Supports configurable top-k token selection
- Provides both single-token and sequence-level entropy

#### `ThresholdCalculator`
- Implements four threshold methods from the paper
- Provides statistical validation and effect size calculation
- Supports few-shot calibration

#### `EarlyStoppingFramework`
- Main framework class combining entropy calculation and threshold-based decisions
- Handles calibration and evaluation
- Provides confidence estimates

### Key Methods

```python
# Calculate entropy from logprobs
entropy = entropy_calc.calculate_sequence_entropy(token_logprobs)

# Calibrate threshold using validation data
framework.calibrate(calibration_data, method="entropy_mean")

# Make stopping decision
result = framework.should_stop_early(token_logprobs, method="entropy_mean")
```

## Threshold Methods

1. **Entropy Mean**: Conservative baseline using mean entropy of correct responses
2. **Information-Theoretic Optimal**: Uses logarithmic scaling with effect size
3. **Bayesian Optimal**: Minimizes classification error under Gaussian assumptions
4. **Scale-Invariant Universal**: Adapts to different model characteristics

## Visualization Toolkit (`analysis/visualization_toolkit.py`)

Creates publication-quality plots for:

- Entropy distributions (correct vs incorrect)
- Token savings by model/method
- Threshold method comparisons
- Cohen's d effect sizes
- Accuracy breakdowns
- Framework overview diagrams

## Configuration

### API Keys

Edit the experiment templates to add your API keys:

```python
OPENROUTER_API_KEY = "your_openrouter_key"  # Replace with actual key
```

### Model Selection

The framework supports any model accessible through OpenRouter or compatible APIs:

- GPT-4, GPT-3.5
- Claude models
- Open source models (Qwen, Llama, etc.)

### Parameters

Key parameters you can adjust:

- `k`: Number of top tokens for entropy calculation (default: 20)
- `temperature`: Sampling temperature (default: 0.7)
- `max_tokens`: Maximum tokens per reasoning step (default: 8192)
- `num_steps`: Number of reasoning steps (default: 4)

## Expected Results

Based on our paper, you should expect:

- **Token Savings**: 25-50% computational cost reduction
- **Accuracy Preservation**: No significant accuracy loss (Δ-Acc ≈ 0%)
- **Effect Sizes**: Cohen's d > 0.5 for models with entropy discrimination
- **Threshold Accuracy**: 88-100% for well-calibrated models

## Troubleshooting

### Common Issues

1. **API Key Errors**: Ensure your OpenRouter API key is valid and has sufficient credits
2. **Import Errors**: Check that all dependencies are installed via `pip install -r requirements.txt`
3. **Empty Logprobs**: Some models may not return logprobs - check API documentation
4. **Memory Issues**: For large datasets, process in batches

### Model Compatibility

The framework requires models that support:
- Token-level log probabilities (logprobs)
- Top-k token extraction
- Deterministic reasoning (for evaluation)

### Expected Performance

Models that show strong entropy discrimination:
- GPT-4 and newer reasoning-optimized models
- Models with advanced post-training (RLHF, Constitutional AI)

Models with limited entropy discrimination:
- Base pretrained models
- Standard instruction-tuned models without reasoning optimization

## Citation

If this paper is accepted and you use this code in your research, please cite the accepted version.

## License

This code is provided for research purposes. Please see the paper for full details on the methodology and experimental setup.

## Reproducibility

This research code is provided as supplementary material for ICLR 2026 submission. The implementation follows the methodology described in the paper and is designed to enable full reproduction of our experimental results.

For questions about the methodology or implementation during the review process, please use the conference's review and discussion mechanisms.

## Acknowledgments

We thank the creators of the AIME and GPQA datasets for providing challenging reasoning benchmarks that enable this research.