Abstract: There has been significant attention to the research on
dense video captioning, which aims to automatically local-
ize and caption all events within untrimmed video. Sev-
eral studies introduce methods by designing dense video
captioning as a multitasking problem of event localization
and event captioning to consider inter-task relations. How-
ever, addressing both tasks using only visual input is chal-
lenging due to the lack of semantic content. In this study,
we address this by proposing a novel framework inspired
by the cognitive information processing of humans. Our
model utilizes external memory to incorporate prior knowl-
edge. The memory retrieval method is proposed with cross-
modal video-to-text matching. To effectively incorporate
retrieved text features, the versatile encoder and the de-
coder with visual and textual cross-attention modules are
designed. Comparative experiments have been conducted
to show the effectiveness of the proposed method on Ac-
tivityNet Captions and YouCook2 datasets. Experimental
results show promising performance of our model with-
out extensive pretraining from a large video dataset. Our
code is available at https://github.com/ailab-
kyunghee/CM2_DVC.
Loading