UniSONAR: Unified Source-Conditioned Attentive Retrieval for Knowledge-Based Visual Question Answering

ACL ARR 2026 January Submission9707 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: knowledge-based visual question answering, retrieval-augmented generation, multimodal reranking, dual-source retrieval, cross-modal fusion, source-conditioned attention
Abstract: Knowledge-Based Visual Question Answering (KB-VQA) requires retrieving entity knowledge from external sources to answer questions that cannot be resolved from visual content alone. However, existing RAG systems suffer from the Single-Source Retrieval Bottleneck and Source-Specific Reranker Degradation due to their reliance on individual retrieval sources. To address these challenges, we propose UniSONAR, a unified lightweight framework that effectively processes candidates from heterogeneous retrieval sources. By integrating dual-source coarse retrieval followed by a novel Source-Conditioned Attentive Fusion, UniSONAR facilitates robust cross-source generalization and enables both entity-level and section-level retrieval. Furthermore, we introduce a hybrid training strategy using contrastive learning and an auxiliary loss to enhance discriminative feature learning. Extensive experiments on E-VQA and InfoSeek demonstrate that UniSONAR achieves state-of-the-art performance. Code will be released.
Paper Type: Long
Research Area: Retrieval-Augmented Language Models
Research Area Keywords: retrieval-augmented generation, knowledge base QA, re-ranking, multimodal QA, vision question answering, passage retrieval
Languages Studied: English
Submission Number: 9707
Loading