Abstract: Metaphors are a common communication tool used in our day-to-day life. The detection and generation of metaphors in textual form have been studied extensively but metaphors in other forms have been under-explored. Recent studies have shown that Vision-Language (VL) models cannot understand visual metaphors in memes and adverts. As of now, no probing studies have been done that involve complex language phenomena like metaphors with videos. Hence, we introduce a new VL task of describing the metaphors present in the videos in our work. To facilitate this novel task, we construct and release two datasets- a manually created dataset with 705 videos and 2115 human-written captions and a synthetic dataset of 90886 MSCOCO images with synthetically generated metaphor captions. We propose a novel video metaphor captioning system: GIT-LLaVA, which uses a frozen video captioning model augmented by a Large Language Model (LLM) for generating metaphors, as a strong baseline. We perform a comprehensive analysis of SOTA video language models on this task. We publish our datasets and benchmark results for our novel task to enable further research.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
0 Replies
Loading