Prompting visual dialog with implicit logical knowledge

Published: 2025, Last Modified: 11 Feb 2026Knowl. Inf. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Visual dialog using additional knowledge can naturally leverage the reasoning and generalization ability to learn to infer better answers. Traditional visual dialog models rely on extracting explicit, symbolic knowledge from entity-centric knowledge bases. However, in real-world scenarios, it is also necessary to perform implicit logical reasoning regarding events, knowledge, and the states of entities involved in the questions to provide accurate answers. This capability requires models to possess a high reserve of implicit knowledge. Thus, in this paper, we propose to prompt visual dialog with implicit logical knowledge from data and model perspectives. In terms of training data, we focus on augmenting both implicit and explicit knowledge through the Chain-of-Thought strategy based on Large Language Models. Leveraging this knowledge-augmented data, we design a novel Dual-Stream Debiasing Network (DSDN) and apply contrastive decoding to search for strings that maximize a weighted difference in likelihood between the stronger knowledge-based modules and the weaker amateur modules, thus mitigating the impact of undesirable language biases. The experimental results and analyses on VisDial v1.0 dataset, demonstrate the superiority of our proposed model. The code will be available soon.
Loading