Semantic Search over 9 Million Mathematical Theorems

Luke Alexander; Eric Leonen; Sophie Szeto; Artemii Remizov; Ignacio Tejeda; Giovanni Inchiostro; Vasily Ilin; Jarod Alper

Semantic Search over 9 Million Mathematical Theorems

Luke Alexander, Eric Leonen, Sophie Szeto, Artemii Remizov, Ignacio Tejeda, Giovanni Inchiostro, Vasily Ilin, Jarod Alper

Published: 05 Mar 2026, Last Modified: 21 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: Semantic search, mathematical theorem retrieval, representation learning, large-scale datasets, embedding-based retrieval, scientific information retrieval

TL;DR: We study semantic retrieval over a corpus of more than 9 million human-authored, research-level mathematical theorems, and show how representation and embedding choices critically affect theorem-level search quality at web scale.

Abstract: Searching for mathematical results remains difficult: most existing tools retrieve entire papers, while mathematicians and theorem-proving agents often seek a specific theorem, lemma, or proposition that answers a query. While semantic search has seen rapid progress, its behavior on large, highly technical corpora such as research-level mathematical theorems remains poorly understood. In this work, we introduce and study semantic theorem retrieval at scale over a unified corpus of 9.2 million theorem statements extracted from arXiv and seven other sources, representing the largest publicly available corpus of human-authored, research-level theorems. We represent each theorem with a short natural-language description as a retrieval representation and systematically analyze how representation context, language model choice, embedding model, and prompting strategy affect retrieval quality. On a curated evaluation set of theorem-search queries written by professional mathematicians, our approach substantially improves both theorem-level and paper-level retrieval compared to existing baselines, demonstrating that semantic theorem search is feasible and effective at web scale.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Submission Number: 31

Loading