GRILL: Grounded Vision-language Pre-training via Aligning Text and Image RegionsDownload PDF

Anonymous

16 Oct 2022 (modified: 05 May 2023)ACL ARR 2022 October Blind SubmissionReaders: Everyone
Keywords: vision and language, few-shot learning, grounding
Abstract: Cross-task generalization is an important ability for few-shot learners to achieve better zero-/few-shot performance on diverse tasks.However, such generalization to vision-language tasks including grounding and generation tasks has been under-explored.Furthermore, existing few-shot VL models struggle to handle tasks that involve object grounding and multiple images such as visual commonsense reasoning or NLVR2.In this paper, we introduce GRILL, \textbf{GR}ounded v\textbf{I}sion \textbf{L}anguage a\textbf{L}igning, a novel VL model that learns object grounding and localization in pre-training and can adapt to diverse grounding tasks with no or very few training instances.Specifically, GRILL exploits object-text alignments and learns to ground objects in pre-training, which enables it to transfer to tasks such as referring expression comprehension and visual commonsense reasoning in a zero-/few-shot fashion. We evaluate our model on various zero-/few-shot VL tasks and show that it consistently surpasses the state-of-the-art few-shot methods.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
0 Replies

Loading