Retrieving Visual Facts For Few-Shot Visual Question Answering

Anonymous

Retrieving Visual Facts For Few-Shot Visual Question Answering

Anonymous

16 Jan 2022 (modified: 05 May 2023)ACL ARR 2022 January Blind SubmissionReaders: Everyone

Abstract: We introduce the Retrieving Visual Facts (RVF) framework for few-shot visual question answering (VQA). The RVF framework represents an image as a set of natural language facts; for example, in practice these could be tags from an object detector. Critically, the question is used to retrieve $\textit{relevant}$ facts: an image may contain numerous details, and one should attend to the few which may be useful for the question. Finally, one predicts the answer from the retrieved facts and the question, e.g., by prompting a language model as we do here. Compared to PICA (Yang et al., 2021), the previous state-of-the-art in few-shot VQA, a proof-of-concept RVF implementation improves absolute performance by 2.6% and 1.5% respectively on the VQAv2 (Goyal et al., 2017) and OK-VQA (Marino et al., 2019) datasets. We also analyze our implementation's strengths and weaknesses on various question types, highlighting directions for further study.

Paper Type: short

0 Replies

Loading