# Independent Evaluation Tool

This is an independent evaluation tool folder containing the minimal set of files required to run evaluations.

## Core Files

### Main Scripts
- `evaluate.py` - Main evaluation entry script
- `convert_raw_predictions.py` - Convert raw prediction results to standard format

### Core Modules
- `evaluator.py` - Main evaluator
- `scorer.py` - Scoring module  
- `predictions.py` - Prediction results loader

### Dependencies
- `benchmarks/` - Scoring benchmarks for various datasets
- `data/datasets/` - Dataset files
- `scripts/utils/` - Utility scripts (especially sanitize.py for code cleaning)
- `predictions/` - Prediction results storage directory

## Usage

### 1. Convert Raw Prediction Results
```bash
python convert_raw_predictions.py --dataset mbpp --raw-dir raw_predictions --py
```

### 2. Run Evaluation
```bash
python evaluate.py --dataset mbpp --verbose
```

### 3. Evaluate All Datasets
```bash
python evaluate.py --dataset all
```

## Supported Datasets

- **GSM8K** - Mathematical reasoning
- **MATH** - Mathematical problems
- **HumanEval** - Python code generation
- **MBPP** - Python programming problems
- **HotpotQA** - Question answering
- **DROP** - Reading comprehension

## Minimum Dependencies

Only the following files are needed to run evaluations:
- `evaluate.py`
- `evaluator.py` 
- `scorer.py`
- `predictions.py`
- `benchmarks/` directory
- `data/datasets/` directory
- `scripts/utils/sanitize.py`

Other files can be deleted as needed.