Do Large Language Models Defend Their Beliefs Consistently?

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY-ND 4.0
Keywords: LLM, Large Language Model, Calibration, Confidence, Consistency, Multi-Turn, Beliefs
TL;DR: We investigate whether LLMs defend their statements in line with their confidence in those statements. We find that LLMs do so to a moderate degree, but with significant variability across model and dataset.
Abstract: When large language models (LLMs) are challenged on their response, they may defer to the user or uphold their response. Some models may be more deferent, while others may be more stubborn in defense of their beliefs. The 'appropriate' level of belief defense depends on the task and user preferences, but it is nonetheless desirable that the model behave *consistently* in this respect. In particular, when a model has a high confidence in its answer, it should not defer more often than when it has a lower confidence; and this should be independent of the model's overall tendency towards deference. We term acting in this manner as being *belief-consistent*, and we carry out the first detailed study of belief-consistency in modern LLMs. We find that models are generally moderately belief-consistent but with significant variability across tasks and models. We also show that belief-consistency is only weakly related to the task performance and the calibration of the model, indicating that it is a distinct aspect of model behavior. We build on this insight to investigate targeted approaches for improving belief-consistency through prompting and activation steering, finding that the latter in particular achieves significant improvements.
Submission Number: 134
Loading