There’s a Time and Place for Reasoning Beyond the ImageDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Images are often more significant than only the pixels to human eyes, as we can infer, associate, and reason with contextual information from other sources to establish a more complete picture. For example, in Figure 1, we can find a way to identify the news articles related to the picture through segment-wise understandings on the signs, the buildings, the crowds, and more. This tells us the time when and the location where the image is taken, which will help us in subsequent tasks, such as evidence retrieval for criminal activities, automatic storyline construction, and upper-stream processing such as image clustering. In this work, we formulate this problem and introduce TARA: a dataset with 16k images with their associated news, time and location automatically extracted from New York Times (NYT), and an additional 61k examples as distant supervision from WIT. On top of the extractions, we present a crowdsourced subset in which images are believed to be feasible to find their spatio-temporal information for evaluation purpose. We show that there exists a $70\%$ gap between a state-of-the-art joint model and human performance, which is slightly filled by our proposed model that uses segment-wise reasoning, motivating higher-level vision-language joint models that can conduct open-ended reasoning with world knowledge.
0 Replies

Loading