Retrieving Visual Facts For Few-Shot Visual Question AnsweringDownload PDF

Anonymous

08 Mar 2022 (modified: 05 May 2023)NAACL 2022 Conference Blind SubmissionReaders: Everyone
Paper Link: https://openreview.net/forum?id=nQuO_vNMGlf
Paper Type: Short paper (up to four pages of content + unlimited references and appendices)
Abstract: We introduce the Retrieving Visual Facts (RVF) framework for few-shot visual question answering (VQA). The RVF framework represents an image as a set of natural language facts; for example, in practice these could be tags from an object detector. Critically, the question is used to retrieve $\textit{relevant}$ facts: an image may contain numerous details, and one should attend to the few which may be useful for the question. Finally, one predicts the answer from the retrieved facts and the question, e.g., by prompting a language model as we do here. Compared to PICA (Yang et al., 2021), the previous state-of-the-art in few-shot VQA, a proof-of-concept RVF implementation improves absolute performance by 2.6% and 1.5% respectively on the VQAv2 (Goyal et al., 2017) and OK-VQA (Marino et al., 2019) datasets. We also analyze our implementation's strengths and weaknesses on various question types, highlighting directions for further study.
0 Replies

Loading