Language-based Audio Retrieval with GPT-Augmented Captions and Self-Attended Audio Clips

Published: 01 Jan 2024, Last Modified: 07 Oct 2025CSCWD 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: With the explosion of user-generated content in recent years, efficient methods for organizing multimedia databases based on content and retrieving relevant items have become essential. Language-based audio retrieval seeks to find relevant audio clips based on natural language queries. However, there exists a scarcity of datasets specifically developed for this task. Moreover, the language annotations often carry biases, leading to unsatisfactory retrieval accuracy. In this work, we propose a novel framework for language-based audio retrieval that aims to: 1) utilize GPT-generated text to augment audio captions, thereby improving language diversity; 2) employ audio self-attention mechanisms to capture intricate acoustic features and temporal dependencies. Experiments conducted on two public datasets, containing both short- and long-term audios, demonstrate that our framework can achieve significant performance improvements compared with other methods. Specifically, the proposed framework can achieve a 27% increase in mean average precision (mAP) on the Clotho dataset, and a 31% improvement in mAP on the AudioCaps dataset compared with the baseline.
Loading