Combining Textual and Structural Information for Premise Selection in Lean

Job Petrovčič; David E. Narváez; Ljupco Todorovski

Combining Textual and Structural Information for Premise Selection in Lean

Job Petrovčič, David E. Narváez, Ljupco Todorovski

Published: 17 Oct 2025, Last Modified: 21 Nov 2025MATH-AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: premise selection, Lean, graph dataset, language model, graph neural network, proof state

TL;DR: We introduce a graph-augmented language approach for premise selection in Lean that uses GNNs to capture structural information, producing embeddings that outperform text-based baselines.

Abstract: Premise selection is a key bottleneck for scaling theorem proving in large formal libraries. Yet existing language-based methods often treat premises in isolation, ignoring the web of dependencies that connects them. We present a graph-augmented approach that combines dense text embeddings of Lean formalizations with graph neural networks over a heterogeneous dependency graph capturing both state–premise and premise–premise relations. On the LeanDojo Benchmark, our method outperforms the ReProver language-based baseline by over 25% across standard retrieval metrics. These results suggest that relational information is beneficial for premise selection.

Submission Number: 121

Loading