Zero-Shot Visual Grounding of Referring Utterances in DialogueDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: This work explores whether current pretrained multimodal models, which are optimized to align images and captions, can be applied to the rather different domain of referring expressions. In particular, we test whether one such model, CLIP, is effective in capturing two main trends observed for referential chains uttered within a multimodal dialogue, i.e., that utterances become less descriptive over time while their discriminativeness remains unchanged. We show that CLIP captures both, which opens up the possibility to use these models for reference resolution and generation. Moreover, our analysis indicates a possible role for these architectures toward discovering the mechanisms employed by humans when referring to visual entities.
0 Replies

Loading