Enhancing semantic audio-visual representation learning with supervised multi-scale attention

Jiwei Zhang, Yi Yu, Suhua Tang, Guo-Jun Qi, Haiyuan Wu, Hirotaka Hachiya

Published: 01 Jan 2025, Last Modified: 07 May 2025Pattern Anal. Appl. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In recent years, breakthroughs in large models such as GPT and Transformers have demonstrated extraordinary versatility and power in various fields and tasks. However, despite significant progress in areas such as natural language processing, these models still face some unique challenges when processing multimodal data. Data from different modalities often contain significantly different characteristics, and the heterogeneity gap between different modalities makes it difficult to fuse these data to extract valuable information. To integrate and align semantic meanings between audio-visual modalities, this paper proposes a novel supervised multi-scale attention for enhancing semantic audio-visual representation learning from multimedia data, utilizing the audio-visual attention mechanism to combine multi-scale features. Specifically, we explore multi-scale feature extraction and audio-visual attention architecture, which computes cross-attention weights based on the correlation between joint feature representations and single-modal representations. In addition, the model is guided to learn powerful discriminative features by minimizing intra-modal and inter-modal discriminative losses and maximizing cross-modal correlations. With the widely used VEGAS and AVE benchmark datasets, our model demonstrates competitive experimental results. Extensive experiments verify that the proposed method significantly outperforms the state-of-the-art cross-modal retrieval methods.