# Unified Evaluation System

This directory contains a unified evaluation system for multiple benchmark datasets including MMLU-Redux and MMMLU.

## Features

- **Unified Interface**: Single script to evaluate on multiple benchmarks
- **Multi-GPU Support**: Parallel evaluation across multiple GPUs
- **Multiple Answer Methods**: Support for both `logits` and `generate` methods
- **Chain-of-Thought**: Optional CoT reasoning support
- **Comprehensive Logging**: Detailed statistics and CoT outputs

## Usage

### Basic Usage

```bash
# Evaluate on MMMLU
python script/evaluation/unified_evaluator.py --config eval_recipe/unified_mmmlu_eval.yaml

# Evaluate on MMLU-Redux
python script/evaluation/unified_evaluator.py --config eval_recipe/unified_mmlu_redux_eval.yaml
```

### Configuration

The evaluation is configured through YAML files with the following structure:

```yaml
# Model configuration
model:
  model_name: model/path  # HuggingFace model or local path
  rosetta_config:  # Only needed for Rosetta models
    base_model: path
    teacher_model: path
    include_response: false

# Output configuration
output:
  output_dir: local/results  # Where to save results

# Evaluation configuration
eval:
  dataset: mmmlu  # or 'mmlu-redux'
  gpu_ids: [0, 1, 2, 3]  # GPUs to use
  answer_method: logits  # or 'generate'
  use_cot: true  # Enable chain-of-thought
  sample_interval: 1  # Sample every N examples
  limit: null  # Limit examples per subject
  checkpoints_dir: path  # For Rosetta models
  subjects: []  # Optional: specific subjects
```

### Supported Datasets

1. **MMMLU** (`openai/MMMLU`)
   - Multilingual version of MMLU
   - Languages: AR, BN, DE, ES, FR, HI, ID, IT, JA, KO, PT, SW, YO, ZH

2. **MMLU-Redux** (`edinburgh-dawg/mmlu-redux-2.0`)
   - Cleaned version of MMLU
   - 57 subjects across STEM, humanities, social sciences, and other categories
   - Filters out problematic questions

### Answer Methods

- **`logits`**: Directly compare logits for answer options A, B, C, D
  - Faster and more deterministic
  - Good for quick evaluation

- **`generate`**: Generate full text response and extract answer
  - More realistic evaluation
  - Provides length statistics and full reasoning

### Output Files

The evaluation produces several output files:

1. **Summary JSON** (`{model}_{dataset}_{method}_{timestamp}_summary.json`)
   - Overall accuracy
   - Per-subject/category accuracies
   - Length statistics (for generate method)

2. **CoT CSV** (`{model}_{dataset}_{method}_{timestamp}_cot.csv`)
   - Detailed question-by-question results
   - Generated reasoning (for CoT mode)
   - Input/output lengths

3. **Length Statistics** (`{model}_{dataset}_{method}_{timestamp}_length.json`)
   - Detailed length information per question
   - Only for generate method

## Code Structure

```
script/evaluation/
├── unified_evaluator.py       # Main evaluation script
├── README.md                   # This file
└── (legacy scripts)           # Previous single-dataset scripts

rosetta/utils/
└── evaluate.py                # Common evaluation utilities

eval_recipe/
├── unified_mmmlu_eval.yaml      # MMMLU configuration
├── unified_mmlu_redux_eval.yaml # MMLU-Redux configuration
├── example_advanced_eval.yaml   # Advanced configuration example
└── minimal_eval.yaml            # Minimal configuration for testing
```

## Common Functions (rosetta/utils/evaluate.py)

The following functions are available for reuse:

- `load_hf_model()`: Load HuggingFace models
- `load_rosetta_model()`: Load Rosetta models with projectors
- `generate_answer_with_logits()`: Generate answer using logits method
- `generate_answer_with_generate()`: Generate answer using text generation
- `extract_answer_from_content()`: Extract answer letter from text
- `set_default_chat_template()`: Set chat template for models
- `get_option_token_ids()`: Get token IDs for options A, B, C, D

## Adding New Datasets

To add a new multiple-choice dataset:

1. Add dataset configuration to `DATASET_CONFIGS` in `unified_evaluator.py`
2. Implement dataset-specific formatting in `format_example()` if needed
3. Implement answer parsing in `parse_answer()` if needed
4. Update this README

## Requirements

- PyTorch
- Transformers
- Datasets library
- NumPy
- PyYAML
- tqdm

## Debugging

Uncomment the debugpy lines at the bottom of the script to enable remote debugging:

```python
import debugpy
debugpy.listen(("0.0.0.0", 5678))
print("Waiting for debugger attach...")
debugpy.wait_for_client()
```
