MCT-VHD: Multi-modal contrastive transformer for video highlight detection

Yinhui Jiang, Sihui Luo, Lijun Guo, Rong Zhang

Published: 2024, Last Modified: 08 Oct 2024J. Vis. Commun. Image Represent. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•Video highlight detection performance is related to the adequacy of multimodal fusion.•The key to multimodal fusion is to mine potential semantic information and complementary features between modalities.•Contrastive learning can better reduce the semantic gap between different modalities.•Video consists of consecutive audiovisual segments, and it is crucial to extract modal context information and segment timing information.•The effectiveness of attentional mechanisms in capturing long-term dependence has been demonstrated in the video domain.