Do Vision-Language Models Revise Beliefs or Just Rationalize? Evidence Update Prompting for Non-Monotonic Visual Reasoning

Published: 25 Mar 2026, Last Modified: 28 May 2026CVPR 2026 Workshop CogVL PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 2: Papers without Workshop Proceedings
Keywords: vision-language models, belief revision, non-monotonic reasoning, defeasible reasoning, prompting strategies, VLM evaluation, anchoring bias, confidence calibration, abductive reasoning, cognitive science
TL;DR: VLMs fail to revise 37-62% of wrong initial answers when given contradicting evidence; Belief-State prompting with explicit hypothesis tracking reduces this stubbornness by 13-18 percentage points.
Abstract: When new visual evidence contradicts an initial interpretation, do vision-language models (VLMs) genuinely revise their beliefs, or do they merely rationalize their first guess? We introduce Evidence Update Prompting (EUP), a two-phase evaluation protocol inspired by defeasible and non-monotonic reasoning from cognitive science. In Phase A, a model receives limited pre-event evidence and forms an initial hypothesis; in Phase B, additional post-event evidence arrives that often requires the model to revise. We compare three prompting strategies---Baseline (standard answer), Belief-State (explicit hypothesis tracking with confidence), and Counterfactual Update ("would your answer differ without the new evidence?")---across three frontier VLMs (GPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet) on 52 BlackSwan-style scenarios requiring abductive reasoning about surprising events. Our findings reveal that (i) all models exhibit substantial stubbornness: 37-62% of initially incorrect answers are never revised despite conflicting evidence; (ii) Belief-State prompting reduces stubbornness by 13-18 percentage points and increases accuracy by 4-8 pp over baseline; (iii) Counterfactual prompting helps models recognize when evidence matters (59-63% say "yes, my answer would differ") but produces only modest behavioral change; and (iv) models display striking confidence inflation in Phase B, with high-confidence predictions rising 2-3$\times$ regardless of whether the answer actually changed. These results establish that current VLMs lack genuine belief revision mechanisms and instead engage in post-hoc rationalization, pointing toward architectures with explicit epistemic state tracking as a path forward.
Submission Number: 9
Loading