MetaphorVU: Towards Metaphorical Video Understanding

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 spotlightEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: A systematic benchmark, extensive experiments and analysis, and an effective enhancement method for metaphorical video understanding.
Abstract: Metaphorical videos are prevalent across various real-world scenarios to convey complex ideas, and understanding them typically requires high-order cognitive capabilities. The lack of systematic studies on metaphorical video understanding not only constrains the real-world applicability of MLLMs but also impedes the thorough assessment of their high-order cognitive capabilities. To bridge this gap, we propose MetaphorVU-Bench, the first systematic and comprehensive benchmark dedicated to metaphorical video understanding. Through experiments, we find current MLLMs struggle with accurate metaphorical video understanding, lagging far behind human level, primarily due to defective cross-domain mapping. Motivated by this finding, we construct a metaphor knowledge graph as mapping augmentation and propose MetaphorBoost, an inference-time enhancement framework achieving consistent performance improvement. Our benchmark, analysis, and method provide useful insights and a foundation for future research on advancing MLLMs. Code: https://github.com/icip-cas/MetaphorVU.
Lay Summary: People often use metaphors in videos to express complex ideas in creative ways—for example, showing a wilting flower to represent sadness, or a soaring bird to symbolize freedom. Understanding such metaphors requires more than just recognizing what's on the screen; it demands deeper thinking, like connecting unrelated concepts and grasping hidden meanings. While today's AI systems that process videos (called multimodal large language models, or MLLMs) are becoming impressively capable, it remains unclear whether they can truly understand these kinds of figurative messages the way humans do. In this work, we introduce MetaphorVU-Bench, the first comprehensive test designed to measure how well AI models understand metaphors in videos. We find that current AI systems lag far behind humans, mainly because they struggle to connect concepts across different domains. To address this, we build a metaphor knowledge graph and develop a method called MetaphorBoost that helps AI better interpret video metaphors without retraining. We hope our work supports future progress toward AI that understands the creative ways humans communicate.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/icip-cas/MetaphorVU
Primary Area: Deep Learning->Large Language Models
Keywords: metaphorical video understanding
Originally Submitted PDF: pdf
Submission Number: 18905
Loading