Abstract: In this paper, we propose a multimodal contrastive network with unbiased distillation (MCUD) for knowledge-based VQA, which consists of contrastive sample construction (CSC), unbiased contrastive distillation (UCD), and hierarchical reasoning (HR) modules. Specifically, CSC constructs contrastive samples by transforming the knowledge corpus and adopting entropy-adjusted answer frequencies to identify the unbiased samples. Additionally, UCD employs a dual-branch feature extractor architecture to separately encode the knowledge corpus and image-question pairs into a shared embedding space. Subsequently, the knowledge-driven contrastive learning is supposed to bridge the modality gap between the knowledge corpus and the cross-modality of image-question pairs. Furthermore, the teacher model in UCD utilizes different distillation strategies for biased and unbiased samples, guiding the student model to establish a generalized unbiased representation. Finally, the HR module conducts chain-of-thought outputs, sequentially locates the contextual sentences in the knowledge corpus, generates the rationale, and infers the answer. Extensive experiments with two datasets, E-VQA and ScienceQA, demonstrate the effectiveness and outperformance of our method.
Loading