Track: long paper (up to 10 pages)
Keywords: logical reasoning evaluation, commitment-aware coherence, negation-consistency violation, abstention and coverage, FOLIO benchmark
TL;DR: Coherence checks can look good by abstention; adding a commitment metric alongside negation-violation reveals an abstention–contradiction frontier on FOLIO v0.0 (204 ex.).
Abstract: Large language models (LLMs) are increasingly used for logical tasks, yet they
frequently exhibit contradictions across closely related queries. A natural response
is to measure logical coherence by checking axioms such as negation
consistency. However, we show that coherence can be vacuous: a model can
appear consistent by refusing to commit to either a statement or its negation. We
propose commitment-aware axiomatic coherence, a lightweight evaluation protocol
that complements a standard negation-coherence check with a commitment
score measuring how much probability mass the model assigns to entailed vs.
refuted outcomes (as opposed to abstention/uncertainty). Using a deterministic
log-probability elicitation procedure (YES/NO) and a simple 3-way decision rule
(True/False/Uncertain), we evaluate four open LLMs on the public FOLIO v0.0
validation split. Results reveal a clear frontier: some models achieve low contradiction
rates primarily by abstaining (low coverage), while others achieve high
coverage at the cost of pervasive negation-coherence violations. Our findings
argue that reliable logical reasoning evaluation requires reporting both coherence
and non-vacuous commitment, not coherence alone.The project is available at
https://meherabb.github.io/Commitment/
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 140
Loading