Semantic-Driven Saliency-Context Separation for Video CaptioningDownload PDFOpen Website

2022 (modified: 16 Nov 2022)ICME 2022Readers: Everyone
Abstract: Video captioning aims at generating a natural language de-scription for a given video clip including not only salient sce-narios but also contextual scenarios. The former reveal the highlight of a video and are usually the focus of most existing captioning methods. The latter, however, are not well ex-plored and even ignored easily, though they may provide cer-tain detailed and latent information that can help with a better understanding of the video. To effectively exploit the infor-mation contained in both, a novel video captioning network is proposed. It has two key modules: Cross-Modality Selection (CMS) and Saliency-Context Adaptive Decoder (SCAD). Specifically, CMS mainly focuses on utilizing the semantic information to distinguish saliency and context. Meanwhile, SCAD adaptively identifies both the saliency and context to generate more detailed and precise captions. Experiments on two benchmark datasets, i.e., MSVD and MSR-VTT, demon-strate the effectiveness of our model through the comparison with state-of-the-art methods.
0 Replies

Loading