Abstract: Highlights•We present a novel collaborative multi-modal graph network (CMGNet) for video captioning. CMGNet aims to exploit complementary information among multi-modality features in caption generation.•We propose a Compression-driven Intra-inter Attentive Graph (CIAG) encoder that uses a Basis Vector Compression (BVC) module to compactly compress nodes and then captures the relationships within and between modality features, enhancing video representation.•We propose an adaptive multi-modal selection (AMS) module to accentuate the relevant modality features dynamically in generating different types of vocabularies, thus improving the performance of video captioning.
Loading