ArgBench: A Lean based Benchmark for Automated Theorem Provers on General-Purpose Reasoning Tasks

ArgBench: A Lean based Benchmark for Automated Theorem Provers on General-Purpose Reasoning Tasks

ICLR 2026 Conference Submission19455 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmark, lean, LLM, theorem proving

Abstract: Recent advances in reinforcement learning–based large-model theorem provers have demonstrated remarkable progress in formal mathematical proof. However, their capabilities in broader formal reasoning tasks remain unclear. To address this gap, we introduce ArgBench, a benchmark dataset grounded in formal argumentation theory, designed to systematically evaluate large models on key abilities such as novel concept understanding and counterexample construction. Our main contributions are as follows. First of all, we select formal argumentation theory—a relatively underexplored domain in logic with many open problems—which substantially reduces the risk of pretraining data leakage or contamination and enables a more faithful assessment of models’ capacity to adapt to new definitions and rules. Secondly, we propose a type-theoretic automatic generation method that constructs large-scale datasets at minimal human cost. Thirdly, the generation algorithm is decoupled from any specific domain, allowing straightforward transfer to other formal reasoning settings. Evaluation on ArgBench reveals that mainstream large-model provers perform poorly overall, with Goedel Prover achieving only a 5.7\% success rate. Further analysis highlights a particular weakness in counterexample construction. Based on these findings, we suggest a promising direction: using ArgBench as a training environment to enhance counterexample construction through reinforcement learning, thereby advancing toward more general-purpose formal reasoning.

Primary Area: datasets and benchmarks

Submission Number: 19455

Loading