Faults in our Formal Benchmarks

Published: 17 Oct 2025, Last Modified: 21 Nov 2025MATH-AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Formalization, Benchmarks
TL;DR: There are several overlooked issues in popular formal benchmarks and this is affecting the quality of formal evaluation.
Abstract: Recent focus in LLM-assisted theorem proving has seen the rapid introduction of several benchmarks. Benchmarks in Lean are considered highly reliable because of the nature of proof assistants \& formal languages. However, we question the quality of popular datasets and find that multiple popular benchmarks contain material defects. We discuss common issues that persist across datasets, such as omitted side conditions, misformalization, and incorrect/incomplete translations, through examples. We aim to establish that such defects occur across all datasets we examined, despite their prevalent use. Bad datasets pose a threat to the quality of evaluations in the field, and we propose some better dataset standards for such releases.
Submission Number: 223
Loading