# Supplementary Material: MADQA Benchmark

This supplementary material contains code and data for reproducing the experiments described in the paper.

## Contents

```
.
├── README.md                    # This file
├── dataset/                     # Benchmark dataset
│   ├── data/                    # Dataset splits (load with load_from_disk)
│   │   ├── train/               # Training split (1,550 examples)
│   │   ├── dev/                 # Development split (200 examples)
│   │   └── test/                # Test split (500 examples, answers non-disclosed)
│   ├── download_dataset.py      # Script to load and explore the dataset
│   └── README.md
├── baseline/                    # Gemini File Search baseline implementation
│   ├── gemini_file_search_agent.py
│   ├── requirements.txt
│   └── README.md
├── eval/                        # Evaluation code
│   ├── evaluate.py              # Main evaluation script
│   ├── metrics.py               # Accuracy, Citation F1, Kuiper metrics
│   ├── requirements.txt
│   └── README.md
└── sample_pdfs/                 # Sample PDF documents (see note below)
    └── README.md
```

## Dataset

The benchmark dataset is included locally in `dataset/data/` and can be loaded with HuggingFace's `load_from_disk`:

```python
from datasets import load_from_disk

# Load all splits
dataset = load_from_disk("dataset/data")

# Access specific splits
train = dataset["train"]  # 1,550 examples
dev = dataset["dev"]      # 200 examples (with ground truth)
test = dataset["test"]    # 500 examples (answers non-disclosed)
```

### Dataset Statistics
- **Questions**: 2,250 total across train/dev/test splits
- **Categories**: Form, Invoice, Letter, Poster, Report, Guide, etc.
- **Domains**: Business, Legal, Technical, etc.

### Important: Test Set

**The test set intentionally has non-disclosed answer variants and evidence locations.**

This is to:
1. Prevent data contamination in LLM training
2. Enable fair benchmarking
3. Maintain benchmark integrity over time

For development, use the `dev` split which includes full ground truth annotations.

## Baseline: Gemini File Search

The `baseline/` directory contains the Gemini File Search baseline implementation.

### Setup

```bash
cd baseline
pip install -r requirements.txt

# Set API key
export GOOGLE_API_KEY="your_google_api_key"
```

### Running the Baseline

```bash
# 1. Index PDFs (from sample_pdfs or your own directory)
python gemini_file_search_agent.py index --pdf-dir ../sample_pdfs

# 2. Run evaluation on dev split
python gemini_file_search_agent.py evaluate results.jsonl --split dev
```

See `baseline/README.md` for detailed usage instructions.

## Evaluation

The `eval/` directory contains the evaluation code.

### Running Evaluation

```bash
cd eval
pip install -r requirements.txt

# Evaluate results against dev split
python evaluate.py results.jsonl --dataset ../dataset/data --split dev

# With detailed breakdown
python evaluate.py results.jsonl --by-category --by-domain

# Compare multiple models
python evaluate.py model1.jsonl model2.jsonl --compare
```

### Metrics

| Metric | Description |
|--------|-------------|
| **Accuracy (Judge)** | LLM-judged correctness with bias correction |
| **Document F1** | Citation accuracy at document level |
| **Page F1** | Citation accuracy at page level |
| **Kuiper Statistic** | Effort-accuracy calibration measure |
| **Wasted Effort Ratio** | Ratio of effort on incorrect vs correct answers |

## Sample PDFs

**Note:** Due to the ICML supplementary material size limit of 100MB, only a small sample of PDF documents is included in `sample_pdfs/`. The full PDF corpus will be made available upon publication.

The sample PDFs demonstrate the document types in the benchmark but are not sufficient for full evaluation.

## Requirements

- Python 3.9+
- Google Cloud account (for Gemini baseline)

## License

This code is provided for research purposes. See the main paper for licensing information.
