Predicting Visual Futures with Image Captioning and Pre-Trained Language ModelsDownload PDF

Anonymous

16 Jun 2021 (modified: 05 May 2023)ACL ARR 2021 Jun Blind SubmissionReaders: Everyone
Abstract: The task of visual forecasting deals with predicting future events from a sequence of input images. Purely pixel-based approaches find this challenging due to the presence of abstract concepts and temporal events at different timescales. In this paper, we present an approach that combines image captioning with pre-trained language models to predict visual futures. By leveraging language as an intermediate medium, our model is able to perform more effective temporal reasoning on two different tasks -- visual story cloze and action forecasting. Despite making the final predictions using only the generated captions, our approach outperforms state-of-the-art systems by $4\%$ and $6\%$ respectively on the two tasks. We find that our model consistently picks images/actions that are semantically relevant to the given image sequence instead of simply relying on visual similarity.
Software: zip
0 Replies

Loading