Abstract: Video is a natural source of multi-modal data with intrinsic correlations between different modalities, such as objects, motions and captions. Though intuitive, such inherent supervision has not been well explored in previous video-audio retrieval works. Besides, existing methods exploit the video stream and the audio stream seperately, whereas ignoring the mutual interactions between them. In this paper, we propose a two-stream model named Multi-modal Aggregation and Co-attention network (MAC), which processes video and audio inputs with co-attentional interactions. Specifically, our method takes raw videos as inputs and extracts aggregated features from multiple modalities to benefit the video representation learning. Then, we introduce the self-attention mechanism to make videos adaptively assign higher weights to the representative modalities. Moreover, we introduce a coattention transformer module to better capture the relations among videos and audios. By exchanging key-value pairs in the multi-headed attention, this module enables video-attended audio features to be incorporated into video representations and vice versa. Experiments show that our method significantly outperform other state-of-the-arts.
0 Replies
Loading