Track: tiny / short paper (up to 4 pages)
Keywords: logical reasoning, consistency, large language models, cross-query contradictions, inference-time methods, benchmark, natural language inference, evaluation
TL;DR: We benchmark 18 frontier LLMs on cross-query logical consistency, reveal universal 36-57pp gaps between individual accuracy and set-level consistency, and propose a training-free method (CGD) that improves consistency for 16/17 models.
Abstract: Large language models answer individual logic questions with reasonable accuracy, yet frequently contradict themselves across logically related queries -- affirming a conditional while denying its contrapositive, or endorsing a transitive chain while rejecting the implied conclusion. We introduce ConsistencyBench, a benchmark of 493 logically entailed question sets (1,904 questions) spanning six categories of formal and commonsense reasoning, designed to measure cross-query logical consistency. We evaluate eighteen frontier LLMs -- including GPT-5.2, GPT-4.1, Claude Opus 4.6, Gemini 2.5 Pro, DeepSeek-R1, o3, and Qwen 2.5 72B -- and find that even the strongest model (GPT-4.1) achieves only 46.7% set-level consistency despite 83.0% individual accuracy, revealing consistency gaps of 36-57 percentage points across all models tested. We propose Consistency-Guided Decoding (CGD), a training-free, model-agnostic inference-time method that detects and repairs cross-query contradictions via NLI-based checking. Across 17 models, CGD improves set-level consistency by +6.6pp on average (up to +19.7pp for GPT-4o), while simultaneously improving individual accuracy by +2.8pp on average, demonstrating that cross-query consistency is a tractable target for inference-time intervention.
Presenter: ~Aayam_Bansal1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 57
Loading