Seeing the Unseen: Visual Metaphor Captioning for Videos

Anonymous

Seeing the Unseen: Visual Metaphor Captioning for Videos

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Metaphors are a common communication tool used in our day-to-day life. The detection and generation of metaphors in textual form have been studied extensively but metaphors in other forms have been under-explored. Recent studies have shown that Vision-Language (VL) models cannot understand visual metaphors in memes and adverts. As of now, no probing studies have been done that involve complex language phenomena like metaphors with videos. Hence, we introduce a new VL task of describing the metaphors present in the videos in our work. To facilitate this novel task, we construct and release two datasets- a manually created dataset with 705 videos and 2115 human-written captions and a synthetic dataset of 90886 MSCOCO images with synthetically generated metaphor captions. We propose a novel video metaphor captioning system: GIT-LLaVA, which uses a frozen video captioning model augmented by a Large Language Model (LLM) for generating metaphors, as a strong baseline. We perform a comprehensive analysis of SOTA video language models on this task. We publish our datasets and benchmark results for our novel task to enable further research.

Paper Type: long

Research Area: Resources and Evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

0 Replies

Loading