Dual-path Collaborative Generation Network for Emotional Video Captioning

Published: 20 Jul 2024, Last Modified: 30 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Emotional Video Captioning (EVC) is an emerging task that aims to describe factual content with the intrinsic emotions expressed in videos. The essential of the EVC task is to effectively perceive subtle and ambiguous visual emotional cues during the caption generation, which is neglected by the traditional video captioning. Existing emotional video captioning methods perceive global visual emotional cues at first, and then combine them with the video features to guide the emotional caption generation, which neglects two characteristics of the EVC task. Firstly, their methods neglect the dynamic subtle changes in the intrinsic emotions of the video, which makes it difficult to meet the needs of common scenes with diverse and changeable emotions. Secondly, as their methods incorporate emotional cues into each step, the guidance role of emotion is overemphasized, which makes factual content more or less ignored during generation. To this end, we propose a dual-path collaborative generation network, which dynamically perceives visual emotional cues evolutions while generating emotional captions by collaborative learning. The two paths promote each other and significantly improve the generation performance. Specifically, in the dynamic emotion perception path, we propose a dynamic emotion evolution module, which first aggregates visual features and historical caption features to summarize the global visual emotional cues, and then dynamically selects emotional cues required to be re-composed at each stage as well as re-composed them to achieve emotion evolution by dynamically enhancing or suppressing different granularity subspace’s semantics. Besides, in the adaptive caption generation path, to balance the description of factual content and emotional cues, we propose an emotion adaptive decoder, which firstly estimates emotion intensity via the alignment of emotional features and historical caption features at each generation step, and then, emotional guidance adaptively incorporate into the caption generation based on the emotional intensity. Thus, our methods can generate emotion-related words at the necessary time step, and our caption generation balances the guidance of factual content and emotional cues well. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: In this paper, we investigate the Emotional Video Captioning (EVC) task, which is an emerging task that aims to describe factual content with the intrinsic emotions expressed in videos. EVC task is the extension of the traditional video captioning, which is correlated with the vision, language and emotion three modalities. Thus, we believe that our work is strong correlated with the ACM Multimedia coference.
Supplementary Material: zip
Submission Number: 4952
Loading