Published: 28 Oct 2023, Last Modified: 17 Nov 2023MATH-AI 23 PosterEveryoneRevisionsBibTeX
Keywords: language models, interactive theorem provers, evaluation
TL;DR: We evaluate the (implicit) mathematical knowledge base of an LLM against the (explicit) knowledge base of an ITP
Abstract: Wiedijk's list of 100 theorems provides a benchmark for comparing interactive theorem provers (ITPs) and their mathematics libraries. As shown by the GHOSTS dataset, large language models (LLMs) can also serve as searchable libraries of mathematics, given their capacity to ingest vast amounts of mathematical literature during their pre-training or finetuning phases. ITP libraries are the only other repositories of comparable size and range of mathematical intricacy. This paper presents the first comparison between these two unique mathematical resources, centered on Wiedijk's list. Beyond the intrinsic interest of such a comparison, we discuss the importance of analyzing whether knowledge contained in LLMs (represented by GPT-4 and Claude 2) matches that encoded in ITPs. This analysis contributes thus further to advance the intersection between LLM and ITP technology (examples being tasks like autoformalization, LLM-guided proof generation, or proof completion) by ensuring LLMs possess, beyond ITP code generation capabilities, sufficient mathematical knowledge to carry out the desired formalization. The dataset with our findings, called "LLMKnow", is made available to the public.
Submission Number: 19