# MADQA Dataset

This directory contains the MADQA benchmark dataset.

## Dataset Structure

```
data/
├── train/          # Training split (1,550 examples)
├── dev/            # Development split (200 examples)  
└── test/           # Test split (500 examples, answers non-disclosed)
```

## Loading the Dataset

```python
from datasets import load_from_disk

# Load all splits
dataset = load_from_disk("./data")

# Access specific splits
train = dataset["train"]
dev = dataset["dev"]
test = dataset["test"]

# Example usage
for example in dev:
    print(example['question'])
    print(example['answer_variants'])
    print(example['evidence'])
```

## Data Fields

| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Unique question identifier |
| `question` | string | The question text |
| `answer_variants` | list[list[str]] | Acceptable answer variants (empty for test) |
| `evidence` | list[dict] | Source documents and pages (empty for test) |
| `document_category` | string | Document type (Form, Invoice, Report, etc.) |
| `domain` | string | Subject domain |

## Test Set Note

**The test set intentionally has non-disclosed answer variants and evidence locations.**

This is to:
1. Prevent data contamination in LLM training
2. Enable fair benchmarking
3. Maintain benchmark integrity

For development and debugging, use the `dev` split which includes full ground truth annotations.

## Sample PDFs

Due to file size constraints (100MB limit), only a small sample of PDF documents is included in `../sample_pdfs/`. The full PDF corpus will be made available separately.

## Statistics

| Split | Questions | With Answers |
|-------|-----------|--------------|
| train | 1,550 | Yes |
| dev | 200 | Yes |
| test | 500 | No (non-disclosed) |
