From Broad Recall to Exact Distinction: Adversarial Curriculum Learning for Knowledge-Based VQA

ICLR 2026 Conference Submission8826 Authors

17 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge-based VQA, Curriculum Learning, Hard Negative Mining, Adversarial Training
TL;DR: We propose Adv-CL, a KBVQA framework whose adversarial reranker plays a minimax game with a modulator that dynamically spotlights the most confusing negatives, enabling fine-grained distinction among noisy candidates and achieving SOTA.
Abstract: Knowledge-based Visual Question Answering (KBVQA) aims to answer image-related questions by retrieving relevant facts from an external knowledge base, making the accuracy of knowledge retrieval crucial. However, a dominant bottleneck in existing systems is that inaccurate facts are fed to the answer generator. This issue stems from two key deficiencies: (i) an initial retrieval stage that relies on global visual features, often overlooking fine-grained evidence, and (ii) a reranking stage that lacks the ability to differentiate between confusing candidates, making the correct answer a lower priority. To address this, we propose the **Adv**ersarial **C**urriculum **L**earning (**Adv-CL**) framework, which tackles these two challenges sequentially. First, we design a Query-guided Multi-grained Recalling (QMR) strategy that leverages both global and query-guided local features to improve the recall quality and provide a diverse set of challenging negatives for reranker training. Subsequently, to enable exact distinction, we introduce an Adversarial Reranker Training (ART) paradigm, which compels the reranker to discern fine-grained distinctions among highly similar candidates. It employs a minimax game where a modulator network acts as an adversary against the reranker, dynamically creating a curriculum of hard negatives by up-weighting candidates that most confuse the reranker. This forces the model to develop its discriminative capability. In addition, we further introduce a Guarded Answer Generation (GAG) mechanism to mitigate the risk of retrieval failure exacerbating the system hallucination. Extensive experiments on public knowledge-based VQA benchmarks show that our method achieves state-of-the-art performance, validating the effectiveness and synergistic effect of broad recall and exact distinction.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8826
Loading