Hierarchical Multimodal Attention Network Based on Semantically Textual Guidance for Video Captioning

Caihua Liu, Xiaoyi Ma, Xinyu He, Tao Xu

Published: 01 Jan 2022, Last Modified: 06 Jun 2025ICONIP (3) 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Utilizing multiple modal information to understand video semantics is quite natural when humans watch a video and describe its contents with natural language. In this paper, a hierarchical multimodal attention network that promotes the information interactions of visual-textual and visual-visual is proposed for video captioning, which is composed of two attention modules to learn multimodal visual representations in a hierarchical manner. Specifically, visual-textual attention modules are designed for achieving the alignment of the semantic textual guidance and global-local visual representations, thereby leading to a comprehensive understanding of the video-language correspondence. Moreover, the joint modeling of diverse visual representations is learned by the visual-visual attention modules, which can generate compact and powerful video representations to the caption model. Extensive experiments on two public benchmark datasets demonstrate that our approach is pretty competitive with the state-of-the-art methods.