Keywords: Knowledge-based VQA, Curriculum Learning, Hard Negative Mining, Adversarial Training
TL;DR: We propose Adv-CL, a KBVQA framework whose adversarial reranker plays a minimax game with a modulator that dynamically spotlights the most confusing negatives, enabling fine-grained distinction among noisy candidates and achieving SOTA.
Abstract: Knowledge-based Visual Question Answering (KBVQA) aims to answer image-related questions by retrieving relevant facts from an external knowledge base, making the accuracy of knowledge retrieval crucial.
However, a dominant bottleneck in existing systems is that inaccurate facts are fed to the answer generator.
This issue stems from two key deficiencies: (i) an initial retrieval stage that relies on global visual features, often overlooking fine-grained evidence,
and (ii) a reranking stage that lacks the ability to differentiate between confusing candidates, making the correct answer a lower priority.
To address this, we propose the **Adv**ersarial **C**urriculum **L**earning (**Adv-CL**) framework, which tackles these two challenges sequentially.
First, we design a Query-guided Multi-grained Recalling (QMR) strategy that leverages both global and query-guided local features to improve the recall quality and provide a diverse set of challenging negatives for reranker training.
Subsequently, to enable exact distinction, we introduce an Adversarial Reranker Training (ART) paradigm, which compels the reranker to discern fine-grained distinctions among highly similar candidates.
It employs a minimax game where a modulator network acts as an adversary against the reranker, dynamically creating a curriculum of hard negatives by up-weighting candidates that most confuse the reranker. This forces the model to develop its discriminative capability.
In addition, we further introduce a Guarded Answer Generation (GAG) mechanism to mitigate the risk of retrieval failure exacerbating the system hallucination.
Extensive experiments on public knowledge-based VQA benchmarks show that our method achieves state-of-the-art performance, validating the effectiveness and synergistic effect of broad recall and exact distinction.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8826
Loading