Seeing the Unseen: Visual Metaphor Captioning for VideosDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: We introduce a new Vision Language task along with datasets and models to understand metaphors in videos
Abstract: Metaphors are a common communication tool used in our day-to-day life. The detection and generation of metaphors in textual form have been studied extensively but metaphors in other forms have been under-explored. Recent studies have shown that Vision-Language (VL) models cannot understand visual metaphors in memes and adverts. As no studies have been done on understanding metaphors in videos, we introduce a new VL task of describing the metaphors present in the videos in our work. To facilitate this novel task, we construct and release two datasets- a manually created dataset with 741 videos and 1142 human-written captions and a synthetic dataset of 90886 MSCOCO images with synthetically generated metaphor captions. We propose a novel video metaphor captioning system: GIT-LLaVA, which uses a frozen video captioning model augmented by a Large Language Model (LLM) to generate captions. We build our model on top of the LLaVA model with the GIT model as the encoder and map its decoder to the LLM (Vicuna) using a lightweight mapping network. We show that this allows the video captioning model to develop the ability to understand video metaphors. We publish our datasets and benchmark results for our new task to enable further research.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
0 Replies

Loading