Abstract: Dense video captioning aims to generate corresponding text
descriptions for a series of events in the untrimmed video, which can be
divided into two sub-tasks, event detection and event captioning. Unlike
previous works that tackle the two sub-tasks separately, recent works
have focused on enhancing the inter-task association between the two
sub-tasks. However, designing inter-task interactions for event detection
and captioning is not trivial due to the large differences in their task
specific solutions. Besides, previous event detection methods normally
ignore temporal dependencies between events, leading to event redundancy or inconsistency problems. To tackle above the two defects, in
this paper, we define event detection as a sequence generation task and
propose a unified pre-training and fine-tuning framework to naturally
enhance the inter-task association between event detection and captioning. Since the model predicts each event with previous events as context, the inter-dependency between events is fully exploited and thus our
model can detect more diverse and consistent events in the video. Experiments on the ActivityNet dataset show that our model outperforms the
state-of-the-art methods, and can be further boosted when pre-trained
on extra large-scale video-text data. Code is available at https://github.
com/QiQAng/UEDVC.
0 Replies
Loading