RHIM: Benchmarking Redundant Hypothesis Identification Reveals Systematic Gaps in LLM Logical Reasoning
Track: long paper (up to 10 pages)
Keywords: Redundant hypothesis detection, mathematical reasoning, large language models, proof verification, automated theorem proving
TL;DR: Current LLMs fail at detecting redundant hypotheses in math proofs with identification accuracy barely above chance and unreliable proof verification.
Abstract: Identifying and removing redundant hypotheses is fundamental to mathematical discovery, yet the ability of large language models to perform this reasoning remains unexplored. We introduce RHIM (Redundant Hypothesis Identification in Mathematics), which is, to the best of our knowledge, the first benchmark for evaluating redundant hypothesis detection in mathematical proof problems, comprising 200 problems with verified ground truth. Through comprehensive experiments with state-of-the-art models — including DeepSeek-Reasoner, Gemini-2.5-Flash, and GPT-5.2 — we reveal critical failures across three hierarchical tasks: detection (false alarm rates 38–99.5\%), identification (accuracy 32–64\%, barely above the 25.8\% random baseline imposed by variable hypothesis counts), and verification (23–34.5\% acceptance of logically invalid proofs). These results demonstrate that proof generation ability does not imply understanding of logical dependencies between assumptions and conclusions — a capability essential for rigorous mathematical reasoning and theorem refinement.
Presenter: ~Hai_Dinh2
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 138
Loading