Abstract: Emotional Video Captioning (EVC) is an emerging task that aims to describe factual content with the intrinsic emotions expressed in videos. Existing EVC methods perceive global emotional cues through visual features at first, and then combine them with the video features to guide the emotional caption generation, which ignores the critical characteristic of the EVC task that emotional cues have intrinsic motivational causes reflected in the video content. Such video causes have a facilitative effect on both emotion perception and emotion-attributed caption generation. To this end, a multi-round mutual emotion-cause pair extraction network (MM-ECPE) is proposed in this paper for the joint extraction of emotional cues and visual causes through iterative mutual refinement. Specifically, in the 1st-round mutual learning, we propose a spatio-temporal disentangled visual adaptive refinement (ST-DVAR) and a multi-level video-guided emotion affine transformation (MV-EAT) to achieve preliminary refinement on video features and emotion lexicon to eliminate the noise caused by emotion-irrelevant visual information and video-irrelevant emotional information. Then, in the 2nd-round mutual learning, we exploit the cross-attention of the preliminary refined features and the original features to obtain the ultimate emotional cues and visual causes, and couple them in pair-wise extraction through contrastive loss. Overall, our approach optimizes complex semantic understanding and emotion perception of videos, leading to a promising performance in emotional captioning. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, e.g., improving the latest records by +97.5% and +76.2% w.r.t. CIDEr and CFS, respectively, on the EVC-MSVD dataset.
Loading