BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs

Published: 17 Oct 2025, Last Modified: 21 Nov 2025MATH-AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: ai, llm, reasoning, math, sycophancy, hallucinations, trustworthy
TL;DR: We introduce BrokenMath, the first benchmark for evaluating LLMs' sycophancy in natural language theorem proving.
Abstract: Large language models (LLMs) have shown strong performance on mathematical benchmarks. However, they are also prone to sycophancy, providing convincing but flawed proofs for incorrect theorems provided by users. Unfortunately, existing benchmarks for mathematical sycophancy are limited, as they rely on simple and often-contaminated final-answer problems, rather than more difficult proof-based tasks. To address this, we introduce BrokenMath, the first benchmark for evaluating LLMs' sycophancy in natural language theorem proving. \bench is built from advanced 2025 competition problems, which are perturbed with an LLM to produce false statements and subsequently refined through expert review. Using an LLM-as-a-judge, we evaluate state-of-the-art LLMs and find that sycophancy is widespread, with the best model, GPT-5, producing sycophantic answers 29% of the time. We further investigate several mitigation strategies and find that these approaches reduce, but do not eliminate, sycophancy.
Supplementary Material: zip
Submission Number: 208
Loading