# Probabilistic VC Dimension

This tool generates and collects data for measuring the Probabilistic VC (PVC) dimension of language models on mathematical problem solving tasks. It provides raw data in JSONL format for subsequent post-processing analysis.

## Overview

1. Generates multiple solution attempts for mathematical problems
2. Has the model self-evaluate which solution is better
3. Collects external judge evaluations for comparison
4. Saves detailed results in JSONL format for later analysis

## Key Features

- **Multi-model Support**: Compatible with Hugging Face and AWS Bedrock models
- **Judge Ensemble**: Uses multiple independent judges for reliable evaluations
- **JSONL Output**: Saves detailed problem-by-problem data for flexible post-processing
- **Math Problem Categories**: Organizes problems by category for domain-specific analysis

## Project Structure

```
probabilistic_vc_estimator/
├── __init__.py         # Package initialization
├── models.py           # Model implementations (HF and Bedrock)
├── judges.py           # Judge ensemble implementation
├── utils.py            # Utility functions for logging, data loading
├── evaluation.py       # Solution generation and evaluation functions
├── experiment.py       # Experiment runner and data collection
├── main.py             # CLI entry point
└── requirements.txt    # Dependencies
```

## Usage

### Basic Usage

```bash
python main.py \
  --model Qwen/Qwen2.5-7B-Instruct \
  --judges c37_sonnet nova_premier deepseek_r1 \
  --problems math_benchmark.jsonl \
  --output results
```

### Advanced Usage

```bash
# Test specific categories with custom limits
python main.py \
  --model OpenThinker2-7B \
  --judges c37_sonnet nova_premier \
  --problems CSQA_sampled.jsonl \
  --categories "algebra" "geometry" \
  --min-problems 5 \
  --max-problems 50 \
  --voting-method weighted \
  --seed 42 \
  --output results/custom_experiment

# Use Hugging Face Math-500 dataset
python main.py \
  --model Qwen/Qwen2.5-Math-7B-Instruct \
  --judges c37_sonnet \
  --problems math-500 \
  --output results/math500_experiment
```

### Command-line Arguments

- `--model`: Model ID for solution generation (HF model name or Bedrock model ID)
- `--judges`: List of judge model IDs for evaluation (space-separated)
- `--voting-method`: Judge decision combination method (`majority` or `weighted`)
- `--problems`: Path to JSONL file or `math-500` for HuggingFace dataset
- `--output`: Base directory to save results
- `--categories`: Specific categories to test (optional, space-separated)
- `--min-problems`: Minimum problems per category (default: 1)
- `--max-problems`: Maximum problems per category (default: 500)
- `--seed`: Random seed for reproducibility (default: 13579)

## Problem Format

The input JSONL file should contain math problems in the following format:

```json
{
  "id": "problem-1",
  "category": "algebra",
  "subcategory": "equations",
  "problem": "Solve for x: 2x + 5 = 15",
  "answer": "x = 5",
  "difficulty": 1
}
```

### Required Fields
- `problem`: The mathematical problem statement
- `answer`: The correct answer or solution
- `category`: Problem category (e.g., "algebra", "geometry", "calculus")
- `difficulty`: Difficulty level (integer)

### Optional Fields
- `id`: Unique problem identifier
- `subcategory`: More specific categorization

### Supported Datasets
- **math_benchmark.jsonl**: Custom mathematical problems
- **CSQA_sampled.jsonl**: CommonsenseQA mathematical subset
- **truthfulQA_sampled.jsonl**: TruthfulQA mathematical problems
- **math-500**: HuggingFace MATH-500 dataset (automatically downloaded)

## Output

The tool generates comprehensive output files in the specified results directory:

### JSONL Files (Raw Data)
- `detailed_[category]_[model]_with_[judges].jsonl`: Category-specific detailed results
- `all_detailed_results_[model]_with_[judges].jsonl`: Combined results across all categories

### Log Files
- `pvc_experiment_[timestamp].log`: Detailed execution logs with timestamps

### Directory Structure
```
results/
└── [model_name]_[dataset_name]/
    ├── detailed_algebra_Qwen2.5-7B-Instruct_with_c37_sonnet_nova_premier.jsonl
    ├── detailed_geometry_Qwen2.5-7B-Instruct_with_c37_sonnet_nova_premier.jsonl
    ├── all_detailed_results_Qwen2.5-7B-Instruct_with_c37_sonnet_nova_premier.jsonl
    └── pvc_experiment_20241201-143022.log
```

## Sample Output Record

```json
{
  "problem_id": "problem-1",
  "problem_text": "Solve for x: 2x + 5 = 15",
  "category": "algebra",
  "subcategory": "equations",
  "solution_a": "Step 1: Subtract 5 from both sides...",
  "solution_b": "Let me solve this systematically...",
  "self_evaluation": {
    "selected_solution": "A",
    "confidence": 0.85,
    "full_response": "<ANALYSIS>Solution A is more direct...</ANALYSIS><WINNER>A</WINNER><SCORE>85</SCORE>"
  },
  "judge_evaluation": {
    "selected_solution": "A",
    "confidence": 0.92,
    "vote_ratio": 0.67,
    "judgments": [
      {"better_solution": "A", "confidence": 0.9, "full_response": "..."},
      {"better_solution": "A", "confidence": 0.8, "full_response": "..."},
      {"better_solution": "B", "confidence": 0.7, "full_response": "..."}
    ]
  },
  "correct_answer": "A",
  "self_eval_correct": true,
  "had_reference_answer": true,
  "reference_answer": "x = 5"
}
```

## Installation

1. Clone the repository:
   ```bash
   git clone <repository-url>
   cd iclr2026/pvc-cpvc-estimator
   ```

2. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```

3. For AWS Bedrock models, configure AWS credentials:
   ```bash
   aws configure
   # or set environment variables: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
   ```

## How It Works

### Solution Generation Process
1. **Dual Solution Generation**: For each problem, generates two solutions using different prompting strategies:
   - `generate_quality_solution_1st()`: Expert-level reasoning approach
   - `generate_quality_solution_2nd()`: Creative alternative approach

2. **Self-Evaluation**: The model evaluates both solutions and selects the better one with confidence

3. **Judge Evaluation**: External judge ensemble evaluates the same solutions for ground truth

4. **Data Collection**: All evaluations, solutions, and metadata are recorded in structured format

### Evaluation Metrics
- **PVC Dimension**: Number of categories where model achieves γ-threshold accuracy
- **C-PVC Dimension**: PVC with additional calibration constraint (τ-threshold)
- **Self-Evaluation Accuracy**: How often model correctly identifies better solution
- **Calibration Error**: Difference between confidence and actual accuracy

### Judge Ensemble Methods
- **Majority Voting**: Simple majority rule with tie-breaking by confidence
- **Weighted Voting**: Confidence-weighted decision aggregation

## Post-Processing Analysis

The generated JSONL files can be processed using the analysis tools in the `../results/` directory:

- **pvc_analysis.py**: Comprehensive PVC/C-PVC analysis with visualization
- **cross_dataset_analysis.py**: Cross-dataset comparison and averaging

## Extending the Tool

### Adding New Models
1. **Hugging Face Models**: Add model name to command line (automatic support)
2. **Bedrock Models**: Add model ID to `BEDROCK_MODEL_IDS` in `models.py`
3. **Custom Models**: Inherit from `LLMModel` class and implement `generate()` method

### Adding New Evaluation Strategies
1. Modify `evaluation.py` to add new solution generation methods
2. Update `experiment.py` to incorporate new evaluation logic
3. Extend judge ensemble methods in `judges.py`

### Custom Problem Formats
1. Update `load_math_problems()` in `utils.py` for new data sources
2. Ensure required fields (`problem`, `answer`, `category`, `difficulty`) are present

## Troubleshooting

### Common Issues
1. **AWS Credentials**: Ensure proper AWS configuration for Bedrock models
2. **Memory Issues**: Use smaller models or reduce `max_problems` for large datasets
3. **Rate Limits**: Bedrock models have built-in retry logic with exponential backoff
4. **Parsing Failures**: Models may occasionally fail to follow output format; marked with `parsing_failed` flag

### Performance Tips
1. Use GPU acceleration for Hugging Face models
2. Adjust `max_new_tokens` in model generation for longer/shorter responses
3. Use `--seed` parameter for reproducible experiments
4. Monitor log files for detailed execution information