Keywords: LEAN, LLM, Benchmark, NLP
TL;DR: We introduced N$\mathsf{L}^2$PS, A Natural Language to LEAN Proofs System, built a benchmark, and conducted an evaluation.
Abstract: The inference capabilities of large language models (LLMs) are rapidly advancing, nearing the limits of current benchmarks. Notably, models like Llama3 have shown substantial improvements on MATH and GSM8k benchmarks. However, these benchmarks may not accurately assess reasoning skills due to their sensitivity to prompt formulation. Lean, a proof assistant and functional programming language, offers a promising framework for evaluating LLM inference capabilities in formal mathematics. Recent integrations of LLMs with Lean have shown potential for automated theorem proving and discovering new theorems, though current performance remains below informal benchmarks. This paper proposes a method for transforming informal datasets into formal ones using Lean, enhancing the evaluation of LLM reasoning. We introduced N$\mathsf{L}^2$PS, an approach includes few-shot prompting, rejection sampling, and backward translation to ensure accurate formalization. We address key research challenges, such as solution finding and progressive proving, to improve the assessment of LLMs in formal domains.
Submission Number: 46
Loading