LiveMathematicianBench: A Benchmark for Research-Level Mathematics with Proof Sketches

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: LLM4Math, Large Language Models, Mathematical Reasoning
Abstract: **Background:** The integration of Large Language Models (LLMs) into scientific workflows represents a transformative frontier in AI. Mathematics serves as an ideal testbed for evaluating these models due to its rigorous logical structure. However, existing benchmarks (e.g., GSM8K, MATH, OlympiadBench) are increasingly inadequate for assessing true scientific competence. **The Challenge:** Current evaluation metrics suffer from two critical limitations. First, there is a fundamental misalignment between ``Olympiad-style'' competition math and authentic research. Existing benchmarks focus on calculation-heavy tricks and closed-form answers, failing to capture the deep structural understanding required in professional research. Second, state-of-the-art models suffer from data contamination, often memorizing solution templates rather than reasoning from first principles. While recent efforts like REALMATH address contamination by sourcing from arXiv, they systematically filter out complex reasoning tasks, thereby reducing research math to mere pattern matching. **The Proposed Solution:** To bridge the gap between LLM evaluation and the cognitive reality of mathematical research, we introduce **LiveMathematicianBench**, a dynamic and comprehensive evaluation framework. This suite shifts the paradigm from measuring an LLM's ability to "solve exams" to its capacity for "comprehending theorems" in their full structural complexity. We expand the scope of evaluation across two critical dimensions: the diversity of mathematical reasoning (moving beyond a "single unique answer") and the inclusion of high-level proof strategies (Proof Sketches) to measure mathematical intuition and hierarchical thought. **Key Contributions:** 1. **Dynamic Data Sourcing:** We implement a continuous pipeline that extracts theorems from arXiv papers published strictly after a model's training cutoff, guaranteeing a contamination-free evaluation environment. 2. **High-Fidelity Taxonomy:** We introduce a novel, logic-first taxonomy featuring seven distinct categories (Equivalence, Implication, Universal, Classification/Bijection, Algorithmic/Constructive, Asymptotic, and Inequality/Bound) to allow for granular diagnosis of logical reasoning capabilities. 3. **Sketch-Adversarial Distractor Generation:** To ensure models cannot rely on naive generalization, we introduce a generative protocol that leverages Proof Sketches to engineer "Intuitive Traps." These hard-negative distractors appear plausible but are falsified by the specific, deep structural logic of the proof, ensuring that successful performance requires genuine comprehension of the underlying mathematical strategy.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 185
Loading