Attending to Visual Differences for Situated Language Generation in Changing ScenesDownload PDF

Anonymous

16 Oct 2021 (modified: 05 May 2023)ACL ARR 2021 October Blind SubmissionReaders: Everyone
Abstract: We investigate the problem of generating utterances from pairs of images showing a before and an after state of a change in a visual scene. We present a transformer model with difference attention heads that learns to attend to visual changes in consecutive images via a difference key. We test our approach in instruction generation, change captioning, and difference spotting and compare these tasks in terms of their linguistic phenomena and reasoning abilities. Our model outperforms the state-of-the-art for instruction generation on the BLOCKS and difference spotting on the Spot-the-diff dataset and generates accurate referential and compositional spatial expressions. Finally, we identify linguistic phenomena that pose challenges for generation in changing scenes.
0 Replies

Loading