UniSONAR: Unified Source-Conditioned Attentive Retrieval for Knowledge-Based Visual Question Answering

UniSONAR: Unified Source-Conditioned Attentive Retrieval for Knowledge-Based Visual Question Answering

ACL ARR 2026 January Submission9707 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: knowledge-based visual question answering, retrieval-augmented generation, multimodal reranking, dual-source retrieval, cross-modal fusion, source-conditioned attention

Abstract: Knowledge-Based Visual Question Answering (KB-VQA) requires retrieving entity knowledge from external sources to answer questions that cannot be resolved from visual content alone. However, existing RAG systems suffer from the Single-Source Retrieval Bottleneck and Source-Specific Reranker Degradation due to their reliance on individual retrieval sources. To address these challenges, we propose UniSONAR, a unified lightweight framework that effectively processes candidates from heterogeneous retrieval sources. By integrating dual-source coarse retrieval followed by a novel Source-Conditioned Attentive Fusion, UniSONAR facilitates robust cross-source generalization and enables both entity-level and section-level retrieval. Furthermore, we introduce a hybrid training strategy using contrastive learning and an auxiliary loss to enhance discriminative feature learning. Extensive experiments on E-VQA and InfoSeek demonstrate that UniSONAR achieves state-of-the-art performance. Code will be released.

Paper Type: Long

Research Area: Retrieval-Augmented Language Models

Research Area Keywords: retrieval-augmented generation, knowledge base QA, re-ranking, multimodal QA, vision question answering, passage retrieval

Languages Studied: English

Submission Number: 9707

Loading