Abstract: Video paragraph captioning aims to describe multiple
events in untrimmed videos with descriptive paragraphs.
Existing approaches mainly solve the problem in two steps:
event detection and then event captioning. Such two-step
manner makes the quality of generated paragraphs highly
dependent on the accuracy of event proposal detection
which is already a challenging task. In this paper, we
propose a paragraph captioning model which eschews the
problematic event detection stage and directly generates
paragraphs for untrimmed videos. To describe coherent
and diverse events, we propose to enhance the conventional
temporal attention with dynamic video memories, which
progressively exposes new video features and suppresses
over-accessed video contents to control visual focuses of
the model. In addition, a diversity-driven training strategy is proposed to improve diversity of paragraph on the
language perspective. Considering that untrimmed videos
generally contain massive but redundant frames, we further augment the video encoder with keyframe awareness to
improve efficiency. Experimental results on the ActivityNet
and Charades datasets show that our proposed model significantly outperforms the state-of-the-art performance on
both accuracy and diversity metrics without using any event
boundary annotations. Code will be released at https:
//github.com/syuqings/video-paragraph.
0 Replies
Loading