GEAR: A $\textbf{G}$eneral $\textbf{E}$valuation Framework for $\textbf{A}$bductive $\textbf{R}$easoning
Keywords: Abductive reasoning, Hypothesis generation, LLM evaluation
Abstract: Since the advent of Large Language Models (LLMs), research has primarily focused on improving their instruction-following and deductive reasoning abilities. Yet a central question remains: can these models truly discover new knowledge, and how can we evaluate this ability? In this work, we address this gap by studying abductive reasoning-the process of generating plausible hypotheses to explain observations.
We introduce **G**eneral **E**valuation for **A**bductive **R**easoning (GEAR), a new general-purpose, fully automated, transparent, and label-free evaluation paradigm that overcomes limitations of prior approaches. GEAR evaluates a set of hypotheses using three metrics: **consistency** (each hypothesis correctly explains the given observations), **generalizability** (consistent hypotheses make meaningful predictions on unseen inputs), and **diversity** (the set of hypotheses covers many distinct predictions and patterns). Built this way, GEAR is scalable (no human gold answers needed), reliable (transparent, deterministic scoring aligned with classical abduction), and open-ended (scores improve only when models produce new, plausible hypotheses, unlike existing static benchmarks that saturate once accuracy is high). Using GEAR, we conduct a fine-grained study of nine LLMs on four popular abduction benchmarks ($1{,}500$ problems), generating $50{,}340$ candidate hypotheses. GEAR reveals model differences and insights that are obscured by prior gold-answer-based or purely human evaluations.
We further propose a momentum-based curriculum training strategy that dynamically adjusts GEAR-derived training data by learning velocity: it begins with what the model learns faster and shifts toward harder objectives such as generating diverse hypotheses once the model is confident on foundational objectives (e.g., instruction following and consistency). Without gold-label supervision, this strategy improves all three GEAR objectives—consistency, generalizability, and diversity—and these gains transfer to established abductive-reasoning benchmarks. Taken together, GEAR provides a principled framework that not only evaluates abduction but also supplies label-free, scalable training signals that help LLMs produce more diverse and reliable hypotheses. We will release code and data upon acceptance.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15353
Loading