CMGNet: Collaborative multi-modal graph network for video captioning

Qi Rao, Xin Yu, Guang Li, Linchao Zhu

Published: 2024, Last Modified: 24 Oct 2024Comput. Vis. Image Underst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•We present a novel collaborative multi-modal graph network (CMGNet) for video captioning. CMGNet aims to exploit complementary information among multi-modality features in caption generation.•We propose a Compression-driven Intra-inter Attentive Graph (CIAG) encoder that uses a Basis Vector Compression (BVC) module to compactly compress nodes and then captures the relationships within and between modality features, enhancing video representation.•We propose an adaptive multi-modal selection (AMS) module to accentuate the relevant modality features dynamically in generating different types of vocabularies, thus improving the performance of video captioning.