Keywords: logical consistency, negation coherence, large language models, FOLIO benchmark, axiomatic evaluation
TL;DR: LLMs can satisfy a simple negation axiom on FOLIO yet still reason poorly; our TinyLlama study shows coherence and competence come apart.
Abstract: Large language models (LLMs) have quickly become the default
tool for a wide range of NLP tasks, yet their logical
behaviour is still poorly understood. Most existing evaluations
focus on task accuracy on benchmarks, without asking
whether a model’s internal “beliefs” about statements are
even coherent under basic logical principles. In this work, we
take a small but concrete step in that direction.
We view an LLM as a black-box function that maps a logical
formula φ to a number p(φ) ∈ [0, 1] that we interpret as the
model’s degree of belief that φ is true. Based on this view, we
introduce a simple axiomatic framework that specifies how
these degrees of belief should behave if they are to resemble
a classical probability measure. We focus on one particularly
transparent constraint: a negation-coherence axiom requiring
p(φ)+p(¬φ) ≤ 1 for every formula φ. From this axiom we
derive a per-instance violation score and an aggregate consistency
metric.
To make this concrete, we instantiate the framework on FOLIO,
a first-order logic reasoning benchmark expressed in
natural language. Using a small open-source chat model,
TinyLlama-1.1B-Chat, we estimate p(φ) and p(¬φ) from
yes/no entailment judgments on a random subset of 200 FOLIO
validation examples. The model turns out to be perfectly
coherent with respect to our negation axiom: we observe zero
violations in our sample. At the same time, its task performance
is poor, with multi-valued accuracy of only 33%. In
practice, the model almost always answers “no” to both the
conclusion and its negation, thereby avoiding contradictions
at the price of being largely uninformative.
Our results highlight a simple but important point: logical coherence
and reasoning competence are distinct properties. An
LLM can be perfectly consistent with a basic logical axiom
while still failing to make useful logical commitments. We
argue that axiomatic consistency metrics such as ours offer a
complementary lens on LLM behaviour, and we outline how
the same framework can be extended to richer logical constraints
and stronger models.
Submission Number: 81
Loading