Axiomatic Negation Coherence in Language Models: Evidence from FOLIO

Published: 28 Dec 2025, Last Modified: 10 Apr 2026AAAI 2026 Bridge LMReasoningEveryoneRevisionsBibTeXCC BY 4.0
Keywords: logical consistency, negation coherence, large language models, FOLIO benchmark, axiomatic evaluation
TL;DR: LLMs can satisfy a simple negation axiom on FOLIO yet still reason poorly; our TinyLlama study shows coherence and competence come apart.
Abstract: Large language models (LLMs) have quickly become the default tool for a wide range of NLP tasks, yet their logical behaviour is still poorly understood. Most existing evaluations focus on task accuracy on benchmarks, without asking whether a model’s internal “beliefs” about statements are even coherent under basic logical principles. In this work, we take a small but concrete step in that direction. We view an LLM as a black-box function that maps a logical formula φ to a number p(φ) ∈ [0, 1] that we interpret as the model’s degree of belief that φ is true. Based on this view, we introduce a simple axiomatic framework that specifies how these degrees of belief should behave if they are to resemble a classical probability measure. We focus on one particularly transparent constraint: a negation-coherence axiom requiring p(φ)+p(¬φ) ≤ 1 for every formula φ. From this axiom we derive a per-instance violation score and an aggregate consistency metric. To make this concrete, we instantiate the framework on FOLIO, a first-order logic reasoning benchmark expressed in natural language. Using a small open-source chat model, TinyLlama-1.1B-Chat, we estimate p(φ) and p(¬φ) from yes/no entailment judgments on a random subset of 200 FOLIO validation examples. The model turns out to be perfectly coherent with respect to our negation axiom: we observe zero violations in our sample. At the same time, its task performance is poor, with multi-valued accuracy of only 33%. In practice, the model almost always answers “no” to both the conclusion and its negation, thereby avoiding contradictions at the price of being largely uninformative. Our results highlight a simple but important point: logical coherence and reasoning competence are distinct properties. An LLM can be perfectly consistent with a basic logical axiom while still failing to make useful logical commitments. We argue that axiomatic consistency metrics such as ours offer a complementary lens on LLM behaviour, and we outline how the same framework can be extended to richer logical constraints and stronger models.
Submission Number: 81
Loading