# PVC-CPVC Analysis Results

This directory contains analysis scripts and tools for processing the raw JSONL output from the PVC-CPVC estimator experiments.

## Overview

The analysis tools provide comprehensive evaluation of model performance across multiple dimensions:
- **PVC (Probabilistic VC) Dimension**: Measures model capability across problem categories
- **C-PVC (Calibration-aware PVC) Dimension**: PVC with calibration constraints
- **Calibration Analysis**: Model confidence vs actual accuracy
- **Cross-Dataset Analysis**: Performance comparison across different datasets

## Files

### Analysis Scripts
- **`pvc_analysis.py`**: Main analysis script for individual datasets
- **`cross_dataset_analysis.py`**: Cross-dataset comparison and averaging
- **`problem_evaluations_*.csv`**: Processed evaluation results per dataset

### Generated Outputs
- **Parameter sweep tables**: `*_parameter_sweep_table.csv`
- **Visualization plots**: `*.png` files
- **Comprehensive metrics**: `*_comprehensive_table_*.csv`

## Installation

```bash
cd results/
pip install -r requirements.txt
```

## Usage

### Individual Dataset Analysis

```bash
# Analyze a specific dataset
python pvc_analysis.py
```

**Note**: Edit the `output_csv` variable at the top of `pvc_analysis.py` to specify which dataset to analyze:
```python
# Choose one of the following:
output_csv = "problem_evaluations_mathbenchmark.csv"
output_csv = "problem_evaluations_truthfulQA.csv" 
output_csv = "problem_evaluations_CSQA.csv"
output_csv = "problem_evaluations_math500.csv"
```

### Cross-Dataset Analysis

```bash
# Compare performance across all datasets
python cross_dataset_analysis.py
```

**Prerequisites**: Ensure the following parameter sweep files exist:
- `mathbenchmark_parameter_sweep_table.csv`
- `truthfulQA_parameter_sweep_table.csv`
- `CSQA_parameter_sweep_table.csv`

## Analysis Features

### PVC Analysis (`pvc_analysis.py`)

#### Key Metrics Calculated
- **PVC Dimension**: Number of categories with accuracy > γ threshold
- **C-PVC Dimension**: PVC with calibration error ≤ τ threshold  
- **ECE (Expected Calibration Error)**: Average calibration error across confidence bins
- **Brier Score**: Mean squared difference between confidence and accuracy
- **Sample Complexity**: Theoretical sample requirements based on PVC theory

#### Visualizations Generated
1. **Combined Category Performance**: Bar chart + PVC line plot
2. **Calibration Error by Category**: Model overconfidence/underconfidence analysis
3. **PVC vs C-PVC Scatter**: Comparison of dimensions with/without calibration
4. **3D C-PVC Surfaces**: Parameter sweep visualization across γ-τ grid
5. **AUC Comparison**: Area under curve metrics for PVC and C-PVC

#### Parameter Sweep Analysis
- **Gamma (γ)**: Confidence threshold range [0, 1] in 0.01 steps
- **Tau (τ)**: Calibration tolerance range [0, 1] in 0.01 steps
- **Total combinations**: 101 × 101 = 10,201 parameter pairs per model
- **Parallel processing**: Uses multiprocessing for efficient computation

### Cross-Dataset Analysis (`cross_dataset_analysis.py`)

#### Features
- **Dataset Averaging**: Combines results from Math360, TruthfulQA, and CSQA
- **Model Ranking**: Consistent ordering across visualizations
- **Cross-Domain Metrics**: Performance generalization analysis
- **Combined Visualizations**: Unified plots across datasets

#### Generated Outputs
1. **Cross-Dataset Average PVC Plot**: Model performance trends
2. **Cross-Dataset C-PVC 3D Grid**: Calibration-aware performance surfaces
3. **Cross-Domain Comparison**: PVC-VUS vs C-PVC-VUS scatter plot

## Output Files

### CSV Files
- `[dataset]_calibration_error_gamma[X]_tau[Y].csv`: Calibration analysis results
- `[dataset]_comprehensive_metrics_table_gamma[X]_tau[Y].csv`: All metrics summary
- `[dataset]_parameter_sweep_table.csv`: Full parameter sweep results (10K+ rows)
- `[dataset]_final_comprehensive_table_gamma[X]_tau[Y].csv`: Final metrics with AUC
- `cross_dataset_averaged_sweep_table.csv`: Cross-dataset averaged results

### Visualization Files
- `[dataset]_combined_model_performance_*.png`: Category accuracy + PVC plots
- `[dataset]_category_calibration_error_*.png`: Calibration error analysis
- `[dataset]_combined_calibration_pvc_plot_*.png`: PVC vs C-PVC comparison
- `[dataset]_all_models_cpvc_3d_grid.png`: 3D surface grid plots
- `[dataset]_auc_pvc_scatter_*.png`: AUC comparison plots
- `cross_dataset_*.png`: Cross-dataset comparison plots

## Key Parameters

### Default Analysis Parameters
- **γ (gamma)**: 0.6 (confidence threshold)
- **τ (tau)**: 0.25 (calibration tolerance)
- **ε (epsilon)**: 0.1 (generalization error bound)
- **δ (delta)**: 0.05 (confidence level for PAC bounds)

### Supported Models
The analysis supports the following models (with consistent color coding):
- Qwen2.5-7B
- Qwen2.5-7B-Instruct  
- Qwen2.5-Math-7B-Instruct
- Llama-3.1-8B-Instruct
- OpenThinker2-7B
- DeepSeek-R1-Distill-Qwen-7B
- Bespoke-Stratos-7B
- JiuZhang3.0-7B
- Ministral-8B-Instruct-2410
- Open-Reasoner-Zero-7B
- s1.1-7B

## Performance Considerations

### Memory Usage
- Parameter sweep analysis requires significant memory for large datasets
- Use multiprocessing with appropriate `max_workers` setting
- Consider processing subsets for very large datasets

### Computation Time
- Full parameter sweep (10K+ combinations) can take 30+ minutes per dataset
- 3D visualization generation adds additional processing time
- Cross-dataset analysis requires all individual sweeps to be completed first

### Optimization Tips
1. **Parallel Processing**: Utilizes `ProcessPoolExecutor` for parameter sweeps
2. **Caching**: Saves intermediate results to avoid recomputation
3. **Incremental Analysis**: Can resume from existing parameter sweep files
4. **Selective Analysis**: Focus on specific γ-τ ranges for faster iteration

## Troubleshooting

### Common Issues
1. **Missing CSV Files**: Ensure JSONL files from experiments are processed first
2. **Memory Errors**: Reduce parameter sweep resolution or use smaller datasets
3. **Plot Generation Failures**: Check matplotlib backend and display settings
4. **Cross-Dataset Errors**: Verify all required sweep tables exist

### Data Requirements
- Input CSV files must contain columns: `model_id`, `category`, `self_eval_correct`, `self_eval_confidence`
- Parameter sweep requires consistent model names across datasets
- Cross-dataset analysis needs at least 2 datasets with overlapping models

## Extending the Analysis

### Adding New Metrics
1. Implement calculation functions following existing patterns
2. Add to comprehensive table generation
3. Include in visualization routines

### Custom Visualizations
1. Use existing color schemes and model ordering for consistency
2. Follow matplotlib/seaborn styling conventions
3. Save plots with descriptive filenames and high DPI

### New Datasets
1. Ensure CSV format matches existing structure
2. Add dataset name mapping in analysis scripts
3. Update cross-dataset analysis to include new data sources