# Evaluation Toolkit

---

## 1. Contents

```
iclr_submission/
  README.md                (this file)
  requirements.txt         (minimal Python dependencies)
  test_prob_vllm_clean.py  (unified async evaluator for multiple datasets)
  dataset_loader.py        (dataset access + lightweight dataclasses)


---

## 2. Supported Datasets

| Dataset  | Mode                          | Notes |
|----------|------------------------------|-------|
| AIME 2024 / 2025 | Short integer answers          | Uses public HuggingFace splits |
| LiveCodeBench (LCB) | Code generation + execution | Requires local JSONL with prompts & test cases |
| GPQA     | Multiple choice (A–D)        | Simple answer extraction |
| MATH-500 | LaTeX math answers           | Uses provided JSONL |
| Hi-ToM (optional) | Short answer         | Same abstraction as others |


## 3. Minimal Usage Examples

### 3.1 AIME (single temperature)
```
python test_prob_vllm_clean.py \
  --dataset aime \
  --aime_version 2025 \
  --model Qwen/Qwen2.5-7B-Instruct \
  --temperatures 0.6 \
  --num_samples 32 \
  --output aime2025_temp06_results.jsonl
```

### 3.2 AIME (multi–temperature sweep)
```
python test_prob_vllm_clean.py \
  --dataset aime \
  --aime_version 2025 \
  --model Qwen/Qwen2.5-7B-Instruct \
  --temperatures 0.2 0.4 0.6 0.8 1.0 \
  --num_samples 64 \
  --output_dir results_aime/
```
(Outputs one JSONL per temperature.)

### 3.3 LiveCodeBench
```
python test_prob_vllm_clean.py \
  --dataset lcb \
  --lcb_jsonl lcb_v6_with_prompts.jsonl \
  --model Qwen/Qwen2.5-7B-Instruct \
  --temperatures 0.8 \
  --num_samples 10 \
  --output lcb_temp08_results.jsonl \
  --evaluate_lcb
```
Add `--save_logprobs` to also dump token arrays (heavier).

### 3.4 GPQA (multiple choice)
```
python test_prob_vllm_clean.py \
  --dataset gpqa \
  --gpqa_jsonl gpqa_dataset.jsonl \
  --model Qwen/Qwen2.5-7B-Instruct \
  --temperatures 0.7 \
  --num_samples 20 \
  --output gpqa_temp07_results.jsonl
```

### 3.5 MATH-500
```
python test_prob_vllm_clean.py \
  --dataset math500 \
  --math500_jsonl math500_level5.jsonl \
  --model Qwen/Qwen2.5-7B-Instruct \
  --temperatures 0.4 0.8 \
  --num_samples 32 \
  --output_dir results_math500/
```
