Enhancing Audio Retrieval with Attention-based Encoder for Audio Feature Representation

Feiyang Xiao, Qiaoxi Zhu, Jian Guan, Wenwu Wang

Published: 2023, Last Modified: 16 May 2025EUSIPCO 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Pretrained audio neural networks (PANNs) has been successful in a range of machine audition applications. But its limitation in recognising relationships between acoustic scenes and events impacts its performance in language-based audio retrieval, which retrieves audio signals from a dataset based on natural language textual queries. This paper proposes the attention-based audio encoder to exploit contextual associations between acoustic scenes/events, using self-attention or graph attention with different loss functions for language-based audio retrieval. Our experimental results show that the proposed attention-based method outperforms most of state-of-the-art methods, with self-attention performing better than graph attention. In addition, the selection of different loss functions (i.e., NT-Xent loss or supervised contrastive loss) does not have as significant an impact on the results as the selection of the attention strategy.