# Benchmark Evaluation

Evaluation library for the MADQA benchmark.

## Installation

```bash
pip install -r requirements.txt
export GOOGLE_API_KEY="your_api_key"  # For semantic evaluation (optional)
```

## Usage

### Command Line

```bash
# Basic evaluation (uses dev split by default)
python evaluate.py results.jsonl

# With category/domain breakdown
python evaluate.py results.jsonl --by-category --by-domain

# Compare multiple models
python evaluate.py model1.jsonl model2.jsonl model3.jsonl --compare

# Output as JSON
python evaluate.py results.jsonl --json

# Use semantic accuracy with LLM judge (requires GOOGLE_API_KEY)
python evaluate.py results.jsonl --semantic

# Specify dataset path
python evaluate.py results.jsonl --dataset ../dataset/data --split dev
```

### Expected Input Format

JSONL file with one prediction per line:

```json
{"id": "test/0", "question": "What is the total revenue?", "answer": "$1.2M", "citations": [{"document": "report.pdf", "page": 5}], "search_history": ["query1", "query2"]}
```

Required fields:
- `question`: The question text (used to match with gold standard)
- `answer`: Predicted answer string

Optional fields:
- `id`: Question ID (fallback if question text doesn't match)
- `citations`: List of `{document, page}` for citation evaluation
- `search_history`: List of search queries (for Kuiper effort analysis)
- `iterations`: Alternative to `search_history` length

### Dataset Splits

By default, evaluates against the `dev` split which has full ground truth.
The `test` split has non-disclosed answers for fair benchmarking.

## Metrics

| Metric | Description |
|--------|-------------|
| **Accuracy (Judge)** | LLM-judged correctness with bias correction |
| **Document F1** | Citation accuracy at document level |
| **Page F1** | Citation accuracy at page level |
| **Kuiper Statistic** | Effort-accuracy calibration (lower = better) |
| **Wasted Effort Ratio** | μ_steps(incorrect) / μ_steps(correct) |

### Hop Type Analysis

Results are automatically broken down by evidence complexity:
- **single**: Answer from a single page
- **cross_page**: Answer requires multiple pages from the same document
- **cross_doc**: Answer requires pages from different documents

## Accuracy (Judge)

When `--semantic` is enabled, the evaluation uses an LLM judge to assess semantic equivalence between predictions and gold answers. This helps with:
- Format variations (e.g., "$1.2M" vs "1.2 million dollars")
- Acceptable verbosity (e.g., "three security questions" vs "3")

The LLM judge scores are bias-corrected using calibration values from human evaluation.

## Python API

```python
from metrics import compute_accuracy, citation_f1, kuiper_statistic

# Accuracy with LLM judge (requires GOOGLE_API_KEY)
result = compute_accuracy("$1.2 million", [["$1.2M"]], question="What is the revenue?")
print(result['score'], result['used_llm'])

# Citation F1
f1 = citation_f1(
    predicted=[{"document": "a.pdf", "page": 1}],
    gold_locations=[{"document": "a.pdf", "page": 1}, {"document": "a.pdf", "page": 2}],
    level='page'
)

# Kuiper statistic
results = [{"steps": 3, "correct": True}, {"steps": 7, "correct": False}, ...]
kuiper = kuiper_statistic(results)
```
