Temperature-scaled large language models for Lean proofstep prediction
Keywords: large language models, theorem proving, interactive theorem proving, Lean, finetuning, overfitting
TL;DR: We identify a metric that allows efficient model selection for finetunings on small datasets, and use it to develop state-of-the-art tactic generation models for Lean.
Abstract: Leveraging the reasoning capabilities of large language models (LLMs) for theorem proving is a promising but challenging task because it requires in-domain finetunings on which LLMs are known to be prone to overfit. This issue is exacerbated by two properties that set theorem proving apart from more mainstream applications of LLMs: training data in formal environments like Lean or Isabelle is very scarce and evaluation benchmarks are prohibitively costly to be used extensively for hyperparameter search and model selection. In this work, we propose temperature scaling as a regularization method for multi-epoch training on small datasets. We explain its theoretical purpose heuristically and demonstrate its effectiveness empirically, obtaining state-of-the-art supervised tactic generation models for Lean 3 of sizes 1.5B, 7B and 13B parameters. Model selection based on temperature-scaled perplexity increases scores on theorem proving benchmarks by up to four percentage points. We provide detailed ablations and analyses of the proof search behaviors of the resulting models, allowing practitioners to pick optimal model sizes for their respective use cases.
Submission Number: 25
Loading