Entailment Closure Failures in Large Language Models: A Benchmark for Cross-Query Logical Consistency

Ben Jenkins

Entailment Closure Failures in Large Language Models: A Benchmark for Cross-Query Logical Consistency

Ben Jenkins

Published: 01 Apr 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: large language models, logical reasoning, entailment closure, cross-query consistency, benchmark, belief consistency, Z3, modus ponens

TL;DR: We introduce ECF-Bench to measure when LLMs affirm premises in separate queries but deny entailed conclusions, and show strong models still break closure often unless given an explicit recap.

Abstract: Large language models (LLMs) are increasingly deployed as implicit knowledge bases, yet their logical consistency across independent queries remains poorly understood. Existing benchmarks evaluate reasoning within a single prompt, neglecting whether an LLM's aggregate commitments satisfy basic properties from classical logic. We introduce ECF-Bench, a benchmark that systematically audits LLMs for entailment-closure failures: cases in which a model affirms a set of premises across separate queries but denies their logically necessary conclusions. ECF-Bench comprises 3,200 test suites spanning propositional logic, first-order taxonomic reasoning, and multi-hop inference chains, with ground-truth labels certified by the Z3 SMT solver. We evaluate seven LLMs and find that all models exhibit substantial closure violations, with failure rates ranging from 17% to 58% depending on reasoning depth and logical structure. Strikingly, models that achieve high single-query accuracy still violate entailment closure at alarming rates, revealing a fundamental gap between local reasoning competence and global logical coherence. We further show that chain-of-thought prompting reduces but does not eliminate these failures. We also test two lightweight mitigations (query self-consistency and recap-conditioned conclusion prompts that surface prior premise text), which cut overall violation rates roughly in half for top models (e.g., GPT-4 from 18.7% to 9.1% CVR) while making explicit where strict query independence is relaxed. Our results highlight the need for cross-query consistency as a first-class evaluation criterion for LLM reasoning.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 169

Loading