# Can LLMs Reliably Evaluate Themselves? A Probabilistic VC Framework

This repository contains the implementation and analysis tools for measuring Probabilistic VC (PVC) and Calibration-aware PVC (C-PVC) dimensions of language models on mathematical reasoning tasks.

## Repository Structure

```
iclr2026/
├── pvc-cpvc-estimator/     # Main experiment framework
│   ├── main.py             # CLI entry point for experiments
│   ├── models.py           # Model implementations (HF + Bedrock)
│   ├── judges.py           # Judge ensemble for evaluation
│   ├── evaluation.py       # Solution generation and evaluation
│   ├── experiment.py       # Experiment orchestration
│   ├── utils.py            # Utility functions
│   ├── requirements.txt    # Dependencies
│   ├── README.md           # Detailed usage guide
│   └── *.jsonl            # Sample problem datasets
└── results/                # Analysis and visualization tools
    ├── pvc_analysis.py     # Individual dataset analysis
    ├── cross_dataset_analysis.py  # Cross-dataset comparison
    ├── requirements.txt    # Analysis dependencies
    ├── README.md           # Analysis guide
    └── *.csv              # Generated analysis results
```

## Quick Start

### 1. Setup Environment

```bash
# Clone repository
git clone <repository-url>
cd iclr2026

# Setup experiment framework
cd pvc-cpvc-estimator
pip install -r requirements.txt

# Setup analysis tools
cd ../results
pip install -r requirements.txt
```

### 2. Run Experiments

```bash
cd pvc-cpvc-estimator

# Basic experiment with Qwen model and multiple judges
python main.py \
  --model Qwen/Qwen2.5-7B-Instruct \
  --judges c37_sonnet nova_premier deepseek_r1 \
  --problems math_benchmark.jsonl \
  --output ../results

# Use HuggingFace Math-500 dataset
python main.py \
  --model OpenThinker2-7B \
  --judges c37_sonnet \
  --problems math-500 \
  --output ../results
```

### 3. Analyze Results

```bash
cd ../results

# Analyze individual dataset (edit script to select dataset)
python pvc_analysis.py

# Cross-dataset comparison (requires multiple sweep tables)
python cross_dataset_analysis.py
```

## Key Concepts

### Probabilistic VC Dimension (PVC)
- Measures the number of problem categories a model can reliably solve
- Based on confidence threshold γ: categories where accuracy > γ
- Provides theoretical sample complexity bounds

### Calibration-aware PVC (C-PVC)  
- Extends PVC with calibration constraints
- Requires both accuracy > γ AND calibration error ≤ τ
- Accounts for model overconfidence in capability assessment

### Self-Evaluation
- Models generate two solutions per problem using different strategies
- Self-evaluate which solution is better with confidence scores
- Compare against external judge ensemble for ground truth

## Supported Models

### Hugging Face Transformers
- Qwen2.5 series (7B, 7B-Instruct, Math-7B-Instruct)
- Llama-3.1-8B-Instruct
- OpenThinker2-7B, DeepSeek-R1-Distill-Qwen-7B
- Bespoke-Stratos-7B, JiuZhang3.0-7B
- Ministral-8B-Instruct-2410
- Open-Reasoner-Zero-7B, s1.1-7B

### AWS Bedrock Models
- Claude 3.7 Sonnet (`c37_sonnet`)
- Amazon Nova Premier (`nova_premier`) 
- DeepSeek R1 (`deepseek_r1`)

## Datasets

### Included Problem Sets
- **math_benchmark.jsonl**: Custom mathematical problems across categories
- **CSQA_sampled.jsonl**: CommonsenseQA mathematical reasoning subset
- **truthfulQA_sampled.jsonl**: TruthfulQA mathematical problems

### External Datasets
- **math-500**: HuggingFace MATH-500 dataset (auto-downloaded)

## Analysis Capabilities

### Individual Dataset Analysis
- PVC/C-PVC dimension calculation across parameter sweeps
- Calibration error analysis by category and model
- Expected Calibration Error (ECE) and Brier score computation
- 3D visualization of C-PVC surfaces across γ-τ parameter space
- Sample complexity estimation based on PVC theory

### Cross-Dataset Analysis
- Performance averaging across Math360, TruthfulQA, and CSQA
- Cross-domain generalization assessment
- Unified model ranking and comparison
- Volume Under Surface (VUS) metrics for PVC and C-PVC

### Visualization Outputs
- Category-wise accuracy and calibration plots
- PVC dimension trends across confidence thresholds
- 3D parameter sweep surfaces for C-PVC analysis
- Cross-dataset performance comparison charts
- AUC scatter plots for PVC vs C-PVC comparison

## Key Features

### Experiment Framework
- **Multi-Judge Evaluation**: Ensemble of 3 independent judges for reliable ground truth
- **Dual Solution Generation**: Two different prompting strategies per problem
- **Comprehensive Logging**: Detailed execution logs with timestamps
- **Flexible Problem Loading**: Support for JSONL files and HuggingFace datasets
- **Robust Error Handling**: Retry logic for API calls and parsing failures

### Analysis Pipeline
- **Parameter Sweep Analysis**: 10K+ γ-τ combinations per model
- **Parallel Processing**: Multiprocessing for efficient computation
- **Incremental Analysis**: Resume from existing parameter sweep files
- **Consistent Visualization**: Unified color schemes and model ordering
- **Export Capabilities**: CSV outputs for further analysis

## Research Applications

### Model Evaluation
- Assess mathematical reasoning capabilities across problem categories
- Measure confidence calibration and self-evaluation accuracy
- Compare performance between different model architectures
- Evaluate impact of instruction tuning and mathematical specialization

### Theoretical Analysis
- Validate PVC dimension theory on language models
- Study relationship between calibration and capability assessment
- Analyze sample complexity requirements for reliable evaluation
- Investigate cross-domain generalization patterns

### Practical Insights
- Identify model strengths and weaknesses by mathematical domain
- Quantify overconfidence and underconfidence patterns
- Guide model selection for mathematical reasoning applications
- Inform training strategies for improved calibration

## Configuration

### Default Parameters
- **γ (gamma)**: 0.6 (confidence threshold)
- **τ (tau)**: 0.25 (calibration tolerance)  
- **Voting method**: majority (for judge ensemble)
- **Max problems**: 500 per category
- **Random seed**: 13579 (for reproducibility)

### AWS Configuration
For Bedrock models, ensure AWS credentials are configured:
```bash
aws configure
# or set: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
```

## Performance Considerations

### Computational Requirements
- **GPU recommended** for Hugging Face models (automatic device detection)
- **Memory usage** scales with model size and problem count
- **API rate limits** handled with exponential backoff for Bedrock models
- **Parallel processing** available for parameter sweep analysis

### Optimization Tips
- Use `--max-problems` to limit dataset size for faster iteration
- Set `--seed` for reproducible experiments
- Monitor log files for detailed execution information
- Cache parameter sweep results to avoid recomputation

## Contributing

### Adding New Models
1. For HuggingFace: Add model name to command line (automatic support)
2. For Bedrock: Add model ID to `BEDROCK_MODEL_IDS` in `models.py`
3. For custom APIs: Inherit from `LLMModel` class

### Extending Analysis
1. Add new metrics to `pvc_analysis.py` calculation functions
2. Include in comprehensive table generation
3. Update visualization routines with consistent styling
4. Document new features in respective README files

### Dataset Integration
1. Ensure JSONL format with required fields: `problem`, `answer`, `category`, `difficulty`
2. Update `load_math_problems()` in `utils.py` for new data sources
3. Add dataset name mapping in analysis scripts

## Citation

If you use this code in your research, please cite:

```bibtex
@misc{pvc-cpvc-2026,
  title={Can LLMs Reliably Evaluate Themselves? A Probabilistic VC Framework},
  author={Anonymous},
  year={2026},
  note={Under review}
}
```

## License

[License information to be added]

## Contact

Anonymous submission for review