StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering

17 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: VQA, KVQA, Multimodal Large Language Model, MLLM, VLM, Visual Language Model, Knowledge-based Visual Question Answering, Reasoning, Self-Distillation, Implicit Knowledge, Visual Question Answering
Abstract: Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. We study its implicit-knowledge variant, *IK-KVQA*, where a multimodal large language model (MLLM) is the sole knowledge source, without external retrieval. Yet, MLLMs lack explicit reasoning supervision and produce inconsistent justifications, and generalize poorly after standard supervised fine-tuning (SFT). We present **\model** (*Structured Reasoning Traces for IK-KVQA*), which supervises structured traces—dual symbolic relation paths plus path-grounded natural-language explanations—so that reasoning becomes transparent and verifiable. With one open-source MLLM, **\model** constructs and selects path-grounded reasoning traces to form a trace-enriched dataset, then fine-tunes via structured self-distillation to align generation with supervision; no external retrievers, verifiers, or curated knowledge bases (KBs) are used, traces are built offline, and inference is a single autoregressive pass. Across benchmarks, **\model** improves both accuracy and interpretability, achieving up to **+11.3%** higher answer accuracy on OK-VQA over the strongest baseline while exhibiting robust cross-domain generalization.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8963
Loading