Solving Inequality Proofs with Large Language Models

Jiayi Sheng; Luna Lyu; Jikai Jin; Tony Xia; Alex Gu; James Zou; Pan Lu

Solving Inequality Proofs with Large Language Models

Jiayi Sheng, Luna Lyu, Jikai Jin, Tony Xia, Alex Gu, James Zou, Pan Lu

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track spotlightEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: Inequality Proving, Large Language Models, LM-as-judge, Theorem Proving, Informal Mathematics, Mathematical Reasoning

TL;DR: We introduce IneqMath, an informal inequality proving benchmark, and an LLM-as-judge suite, revealing that top LLMs achieve <10% overall accuracy due to flawed step-wise reasoning.

Abstract: Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an *informal yet verifiable* task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release *IneqMath*, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation suite, combining a *final-answer* judge with four specialized *step-wise* judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on *IneqMath* reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/AI4Math/IneqMath

Code URL: https://github.com/lupantech/ineqmath

Supplementary Material: zip

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 560

Loading