# Data

The benchmark JSONLs. Code for rebuilding them is in the software bundle.

## Layout

- `global_perturbations/<dataset>/dataset*.jsonl` — five styles per problem:
  the original NL (`dataset.jsonl`) plus four meaning-preserving rewrites,
  two from Gemini-2.5-Flash (`*_gemini_rephrase.jsonl`, `*_gemini_step.jsonl`)
  and two from Qwen3-Next-80B-A3B-Thinking (`*_qwen3_rephrase.jsonl`,
  `*_qwen3_step.jsonl`).
- `local_perturbations/<dataset>/{number_edit,symbol_edit,step_delete}/`
  — single-edit perturbations. Number and symbol edits have separate files
  for the statement and the proof; step deletion has one file (proofs only).
- `concurrent_submissions/` — anonymized PDF of a concurrent submission by
  overlapping authors (ARR originality requirement; see bottom of this file).

`<dataset>` is `miniF2F/` (244 problems) or `MATH-500/` (500 problems).

## Counts

| Local edit | miniF2F | MATH-500 |
|---|---:|---:|
| number, statement | 227 | 490 |
| number, proof | 241 | 489 |
| symbol, statement | 130 | 195 |
| symbol, proof | 195 | 349 |
| step deletion | 129 | 192 |

Total local edits: 2,637. Global perturbations: 744 problems × 5 styles
= 3,720 instances.

## Record format

Each line is one JSON object with `name` (problem id),
`informal_statement`, `informal_proof`, and `formal_statement` (the gold
Lean 4 signature; miniF2F only).

Local-edit records additionally carry the unperturbed pair
(`original_informal_statement`, `original_informal_proof`) and the
metadata for the single edit:

- Number edits: `number_edit_source` (`statement` or `proof`),
  `number_edit_char_offset_start` / `_end`,
  `number_edit_old_value`, `number_edit_new_value`.
- Symbol edits: same shape with `symbol_edit_*` and
  `_old_symbol` / `_new_symbol`.
- Step deletion: `step_edit_*` (step index, removed reasoning span). The
  edit is verified structurally from `original_informal_proof` vs
  `informal_proof` rather than from these offsets.

## Source and license

miniF2F problems and Lean signatures come from the public miniF2F
repository (the Lean part is Apache-2.0). MATH-500 problems come from the
MATH dataset (MIT); for each problem we append the gold answer as a
"Show that it is X" clause. Global rewrites are generated by
Gemini-2.5-Flash and Qwen3-Next-80B-A3B-Thinking. We release this
benchmark under the MIT License.

## Concurrent submission (ARR originality requirement)

Per ARR's originality/overlap rules, the anonymized PDF of a concurrent
submission by overlapping authors is included here:

- `concurrent_submissions/Anonymous_BASE.pdf` —
  *Formalize Once, Edit the Rest: Efficient Lean-Based Answer Selection
  for Math Reasoning* (Anonymous, under review, 2026).

That paper proposes BASE, a base-and-edit pipeline for efficient answer
selection in math reasoning. It is cited and discussed in the Related
Work section of the present paper. Its contribution (selection
efficiency) is distinct from ours (robustness evaluation of proof
autoformalization); the two papers do not overlap in stated contributions.
