StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering

Zhihao Wen; wenkang wei; Yuan Fang; Xingtong Yu; Hui Zhang; Weicheng Zhu; Xin Zhang

StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering

Zhihao Wen, wenkang wei, Yuan Fang, Xingtong Yu, Hui Zhang, Weicheng Zhu, Xin Zhang

17 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: VQA, KVQA, Multimodal Large Language Model, MLLM, VLM, Visual Language Model, Knowledge-based Visual Question Answering, Reasoning, Self-Distillation, Implicit Knowledge, Visual Question Answering

Abstract: Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. We study its implicit-knowledge variant, *IK-KVQA*, where a multimodal large language model (MLLM) is the sole knowledge source, without external retrieval. Yet, MLLMs lack explicit reasoning supervision and produce inconsistent justifications, and generalize poorly after standard supervised fine-tuning (SFT). We present **\model** (*Structured Reasoning Traces for IK-KVQA*), which supervises structured traces—dual symbolic relation paths plus path-grounded natural-language explanations—so that reasoning becomes transparent and verifiable. With one open-source MLLM, **\model** constructs and selects path-grounded reasoning traces to form a trace-enriched dataset, then fine-tunes via structured self-distillation to align generation with supervision; no external retrievers, verifiers, or curated knowledge bases (KBs) are used, traces are built offline, and inference is a single autoregressive pass. Across benchmarks, **\model** improves both accuracy and interpretability, achieving up to **+11.3%** higher answer accuracy on OK-VQA over the strongest baseline while exhibiting robust cross-domain generalization.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 8963

Loading