# Evaluation Module

This module provides tools for evaluating language models on various benchmarks.

## Available Evaluation Types

The following evaluation types are supported:

- `tinyMMLU`: A smaller version of the MMLU benchmark
- `MMLU`: The full Massive Multitask Language Understanding benchmark
- `MMLUPro`: The MMLU Pro benchmark (requires additional setup)
- `tinyGSM8k`: A smaller version of the GSM8k benchmark for math reasoning
- `tinyArc`: A smaller version of the ARC benchmark
- `IFEval`: The IF-Eval benchmark for instruction following
- `harmbench`: Benchmark for evaluating harmful outputs

## Command Line Arguments

- `--model_name`: Name of the model to evaluate (required)
- `--adapter_path`: Path to the adapter weights (optional)
- `--eval_type`: Evaluation types to run (default: ["harmbench", "tinyMMLU"])
- `--batch_size`: Batch size for evaluation (0 = auto-determine)
- `--do_base_eval`: Whether to evaluate the base model (0 = no, 1 = yes)
- `--do_unlocked_eval`: Whether to evaluate the model with adapter (0 = no, 1 = yes)
- `--use_vllm`: Whether to use VLLM for evaluation (0 = no, 1 = yes)
- `--torch_dtype`: Torch data type to use (default: "bfloat16")

## Examples

Evaluate a model with adapter on MMLU Pro:
```bash
python -m evaluation.eval --model_name llama2_7b --adapter_path path/to/adapter --eval_type MMLUPro --use_vllm 1
```

Evaluate both base model and adapter model on multiple benchmarks:
```bash
python -m evaluation.eval --model_name llama2_7b --eval_type tinyMMLU MMLUPro tinyGSM8k --do_base_eval 1 --do_unlocked_eval 1
```

Use the wrapper script:
```bash
python run_eval.py --model_name llama2_7b --eval_type MMLUPro
``` 