# Experiments Directory

This directory contains experiment outputs, results, and analysis from running the entropy-based early stopping framework.

## Directory Structure

```
experiments/
├── README.md                           # This file
├── aime24/                            # AIME 2024 experiment results
├── aime25/                            # AIME 2025 experiment results
├── gpqa/                              # GPQA Diamond experiment results
├── cross_model/                       # Cross-model comparison results
├── ablation/                          # Ablation study results
└── visualizations/                    # Generated plots and figures
```

## File Naming Convention

Results are automatically saved with descriptive names and timestamps:

- **Format**: `{dataset}_{model}_{timestamp}.json`
- **Example**: `aime24_gpt-4_20240315_143022.json`
- **Plots**: `{analysis_type}_{model}_{dataset}.pdf/png`

## Result File Structure

Each experiment generates a comprehensive JSON file containing:

```json
{
  "experiment_info": {
    "model": "model_name",
    "dataset": "dataset_name", 
    "timestamp": "ISO_timestamp",
    "parameters": {...}
  },
  "accuracy_metrics": {
    "step1_accuracy": 0.75,
    "final_accuracy": 0.82,
    "threshold_accuracy": 0.95
  },
  "entropy_statistics": {
    "correct_entropies": {...},
    "incorrect_entropies": {...},
    "effect_size": {...}
  },
  "threshold_analysis": {
    "entropy_mean": {...},
    "information_theoretic": {...},
    "bayesian": {...},
    "scale_invariant": {...}
  },
  "problem_data": [...],
  "summary": {...}
}
```

## Key Metrics Tracked

### Accuracy Metrics
- **Step-1 Accuracy**: Performance using only the first reasoning step
- **4-Step Sequential Accuracy**: Performance after full reasoning process
- **Threshold Accuracy**: Accuracy of questions below entropy threshold

### Efficiency Metrics
- **Token Savings**: Percentage of computational cost reduction
- **Early Stop Rate**: Percentage of questions stopped early
- **Threshold Coverage**: Percentage of correct answers below threshold

### Statistical Validation
- **Cohen's d**: Effect size between correct/incorrect entropy distributions
- **Statistical Significance**: p-values from t-tests
- **Confidence Intervals**: Bootstrap confidence intervals for metrics

## Analysis Types

### 1. Individual Model Analysis
- Single model performance on one dataset
- Threshold calibration and validation
- Entropy distribution analysis

### 2. Cross-Model Comparison
- Performance across different model sizes/families
- Consistency of entropy-based confidence
- Scaling behavior analysis

### 3. Cross-Dataset Validation
- Generalization across reasoning domains
- Mathematical vs scientific reasoning
- Domain-specific entropy patterns

### 4. Ablation Studies
- Top-k logprobs parameter sensitivity
- Threshold method effectiveness
- Sequential reasoning step analysis

## Generated Visualizations

The framework automatically creates publication-quality plots:

### Entropy Analysis
- `entropy_distributions_{model}_{dataset}.pdf/png`
- `cohens_d_comparison_{analysis}.pdf/png`
- `statistical_significance_{analysis}.pdf/png`

### Performance Metrics
- `token_savings_comparison.pdf/png`
- `accuracy_breakdown_{model}.pdf/png`
- `threshold_method_comparison.pdf/png`

### Framework Overview
- `framework_overview_diagram.pdf/png`
- `comprehensive_dashboard_{experiment}.pdf/png`

## Usage Examples

### View Experiment Results
```python
import json

# Load experiment results
with open('experiments/aime24/aime24_gpt-4_20240315_143022.json', 'r') as f:
    results = json.load(f)

# Print summary
print(f"Model: {results['experiment_info']['model']}")
print(f"Token Savings: {results['summary']['best_token_savings']}")
print(f"Threshold Accuracy: {results['summary']['best_threshold_accuracy']}")
```

### Analyze Multiple Experiments
```python
import pandas as pd
from glob import glob

# Load all AIME24 results
result_files = glob('experiments/aime24/*.json')
summary_data = []

for file in result_files:
    with open(file, 'r') as f:
        data = json.load(f)
    summary_data.append({
        'model': data['experiment_info']['model'],
        'token_savings': data['summary']['best_token_savings'],
        'accuracy': data['summary']['best_threshold_accuracy']
    })

df = pd.DataFrame(summary_data)
print(df.describe())
```

### Create Custom Visualizations
```python
from analysis.visualization_toolkit import EntropyVisualization

# Load results
results = load_experiment_results('experiments/aime24/results.json')

# Create visualizations
viz = EntropyVisualization()
viz.plot_entropy_distributions(
    results['entropy_stats']['correct_entropies']['values'],
    results['entropy_stats']['incorrect_entropies']['values'],
    save_path='experiments/visualizations/custom_entropy_dist'
)
```

## Best Practices

### Running Experiments
1. **Start Small**: Test with a few problems before running full experiments
2. **Check API Limits**: Monitor API usage and rate limits
3. **Save Incrementally**: Large experiments should save progress periodically
4. **Version Control**: Track changes to experiment parameters

### Analyzing Results
1. **Compare Baselines**: Always compare against step-1 and full reasoning baselines
2. **Statistical Validation**: Check significance of entropy discrimination
3. **Effect Sizes**: Report Cohen's d for interpretable effect sizes
4. **Cross-Validation**: Validate thresholds on held-out data

### Reporting Results
1. **Include All Metrics**: Report accuracy, efficiency, and statistical measures
2. **Show Distributions**: Include entropy distribution plots
3. **Error Analysis**: Analyze failure cases and edge conditions
4. **Reproducibility**: Save all parameters and random seeds

## Troubleshooting

### Common Issues
- **Missing Results**: Check experiment logs for API errors or timeouts
- **Low Effect Sizes**: Some models may not show entropy discrimination
- **Accuracy Drops**: Verify threshold calibration on appropriate data
- **Visualization Errors**: Ensure matplotlib/seaborn are properly installed

### Performance Optimization
- **Batch Processing**: Process multiple problems in parallel where possible
- **Caching**: Cache expensive computations like entropy calculations
- **Memory Management**: Clear large data structures after processing

This directory serves as a comprehensive record of all experimental work, enabling reproducible research and thorough analysis of the entropy-based early stopping framework.