Track: tiny / short paper (up to 4 pages)
Keywords: logical consistency, configuration sensitivity, chain-of-thought, cross-query contradiction, evaluation methodology, abductive reasoning
TL;DR: Varying the system prompt or CoT setting between logically related queries roughly doubles contradiction rates versus a same-configuration baseline, with abductive reasoning most fragile and CoT amplifying the effect.
Abstract: Logical reasoning evaluations score responses independently under a single configuration. We ask whether answers to logically related questions remain mutually consistent when the system prompt or chain-of-thought elicitation varies between queries. We introduce a protocol that queries models on 120 question-pairs (deductive, inductive, abductive) under six configurations and checks answer-pairs for logical compatibility, reporting both a same-configuration baseline and a cross-configuration condition to isolate the perturbation effect. Across four models, cross-configuration per-check contradiction rates are roughly double same-configuration baselines ($p < 0.001$ pooled, $\chi^2$ test), confirming that configuration changes induce contradictions beyond those attributable to intrinsic model inconsistency. Abductive pairs are most fragile. Chain-of-thought prompting reduces deductive contradictions but increases abductive ones - decomposition shows CoT both worsens abductive consistency within a fixed configuration and makes it more sensitive to configuration changes. We argue that a model's logical commitments should not shift with surface-level configuration changes, and that cross-query consistency under perturbation is a missing axis in reasoning evaluation.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 199
Loading