Less Descriptive yet Discriminative: Quantifying the Properties of Multimodal Referring Utterances via CLIPDownload PDF

Anonymous

Published: 29 Mar 2022, Last Modified: 05 May 2023CMCL 2022Readers: Everyone
Keywords: referring expressions, visual grounding, multimodal models, multimodal dialogue
TL;DR: We use a transformer-based pre-trained multimodal model, CLIP, to shed light on the strategies employed by human speakers when referring to visual entities in multimodal dialogue.
Abstract: In this work, we use a transformer-based pre-trained multimodal model, CLIP, to shed light on the mechanisms employed by human speakers when referring to visual entities. In particular, we use CLIP to quantify the degree of descriptiveness (how well an utterance describes an image in isolation) and discriminativeness (to what extent an utterance is effective in picking out a single image among similar images) of human referring utterances within multimodal dialogues. Overall, our results show that utterances become less descriptive over time while their discriminativeness remains unchanged. Through analysis, we propose that this trend could be due to participants relying on the previous mentions in the dialogue history, as well as being able to distill the most discriminative information from the visual context. In general, our study opens up the possibility of using this and similar models to quantify patterns in human data and shed light on the underlying cognitive mechanisms.
4 Replies

Loading