Answer-Set Consistency of LLMs for Question Answering

ICLR 2026 Conference Submission21668 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, Question Answering, Consistency
TL;DR: We identify answer-set inconsistency in LLMs, provide a benchmark to evaluate this phenomenon, and propose mitigation strategies.
Abstract: Large Language Models (LLMs) sometimes contradict themselves when answering factual questions, especially when asked to enumerate all entities that satisfy the question. We formalize such self-contradiction as answer-set inconsistency: given two enumeration questions whose answers satisfy a set-theoretic relation (equivalence, disjointness, containment, etc.), the LLM generates responses violating the relation. To diagnose this phenomenon, we create a benchmark dataset comprising tuples of enumeration questions over which a variety of set-theoretic relations hold, and propose related metrics to quantify answer-set inconsistency. Our evaluation of several state-of-the-art LLMs reveal pervasive inconsistency across models, even in cases where the LLM can identify the correct relation. This leads us to further analyze potential causes and propose mitigation strategies wherein the LLM is prompted to reason about such relations before answering, which lead to improved answer-set consistency. This work thus provides both a benchmark and a systematic approach for evaluating, explaining, and addressing answer set inconsistency in LLM question answering, towards deriving practical insights to improve the reliability of LLMs.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 21668
Loading