# Dataset Schema

## Statement-SFT Training Sample

`dataset/sampled_train/statement_sft_train_sample.jsonl`

```json
{
  "id": 34152,
  "split_key": "RelSeries",
  "prompt": "Translate the following Lean formal statement ...",
  "completion": "theory ... end"
}
```

The completion is an Isabelle theory skeleton and may use `sorry`. The sampled file is drawn from the Herald-ISA training split.

## Theory-SFT Training Sample

`dataset/sampled_train/theory_sft_train_sample.jsonl`

```json
{
  "id": 34152,
  "split_key": "RelSeries",
  "prompt": "### Task ...",
  "completion": "theory ... end"
}
```

The completion is a complete Isabelle theory and must not contain `sorry` or `oops` outside comments. The sampled file is drawn from the Herald-ISA training split.

## GRPO Sample

`dataset/sampled_train/grpo_train_sample.jsonl`

```json
{
  "id": "lean_workbook_plus_44891",
  "split_key": "lean_workbook_plus_44891",
  "lean": {
    "name": "...",
    "header": "",
    "formal_theorem": "...",
    "formal_proof": "...",
    "informal_theorem": "...",
    "informal_proof": "",
    "isabelle_statement": "theory ... end"
  }
}
```

These examples support Reference-Statement proof translation and GRPO prompt
construction. `src/train_grpo_mvp.py prepare` converts these rows into a
TRL-compatible prompt dataset with the fields `id`, `split_key`, `lean_name`,
and `prompt`. Training-time verifier output is not stored in the sampled
dataset.

## MiniF2F-DSP

`dataset/minif2f_dsp_isa.jsonl`

```json
{
  "id": "aime_1983_p1",
  "split_key": "aime_1983_p1",
  "lean": {
    "name": "...",
    "header": "...",
    "formal_theorem": "...",
    "formal_proof": "...",
    "informal_theorem": "...",
    "informal_proof": ""
  }
}
```

This external set is for Predicted-Statement evaluation. It intentionally has no reference Isabelle statement.

## Sanitization

All sampled files were produced and sanitized by external release tooling. The
policy is whitelist-only: no teacher candidates, verifier traces, API logs,
RAG snapshots, local paths, or secrets are kept.
