Motivation Aware Question Decomposition: An approach to Debugging Reasoning in Vision-Language Models

Eric Peh; Debaditya Roy; Hao Zhang; Basura Fernando

Motivation Aware Question Decomposition: An approach to Debugging Reasoning in Vision-Language Models

Eric Peh, Debaditya Roy, Hao Zhang, Basura Fernando

15 Sept 2025 (modified: 17 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Computer Vision, Visual Question Answering

Abstract: Vision-Language Models (VLMs) have achieved impressive results in Visual Question Answering (VQA), yet they remain prone to hallucination -- generating plausible but visually unsupported answers. Existing approaches have attempted to mitigate hallucination by introducing multi-turn VQA, where the model answers intermediate sub-questions or follows step-by-step reasoning. In this work, we explore a fundamental and underexamined question: where should a model look during multi-turn VQA when a visual inventory is available? We propose Reflection with Visual Inventory (RVI), a cognitively inspired framework that structures visual reasoning through iterative question decomposition and localized image inspection. Rather than treating the image as a single static input, our RVI builds and maintains a Visual Inventory—a dynamic collection of semantically relevant image crops that direct attention and support answer verification throughout the reasoning process. At each step, the system poses binary sufficiency queries to determine whether the current sub-question can be resolved using the existing inventory. If sufficiency fails, the model reflects by updating the inventory emulating human-like visual reasoning and self-correction. RVI builds visual grounding into each step of reasoning, moving beyond static or post-hoc grounding and giving clear, step-by-step supervision. RVI makes errors like poor decomposition or weak grounding visible, giving clear signals that help systematically debug VLM reasoning. We demonstrate that integrating RVI into multiple VLM architectures improves performance on VQA instances from GQA and A-OKVQA datasets where baseline models fail, highlighting its effectiveness in reducing hallucinations and enhancing answer fidelity.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 5372

Loading