Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

Jinwoo Choi, Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Seong Tae Kim

Published: 20 Jun 2024, Last Modified: 03 Feb 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: There has been significant attention to the research on dense video captioning, which aims to automatically local- ize and caption all events within untrimmed video. Sev- eral studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. How- ever, addressing both tasks using only visual input is chal- lenging due to the lack of semantic content. In this study, we address this by proposing a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowl- edge. The memory retrieval method is proposed with cross- modal video-to-text matching. To effectively incorporate retrieved text features, the versatile encoder and the de- coder with visual and textual cross-attention modules are designed. Comparative experiments have been conducted to show the effectiveness of the proposed method on Ac- tivityNet Captions and YouCook2 datasets. Experimental results show promising performance of our model with- out extensive pretraining from a large video dataset. Our code is available at https://github.com/ailab- kyunghee/CM2_DVC.