LLM-Augmented Soft-Label Distillation and Cluster-Guided Alignment for Language-Based Audio Retrieval
Keywords: Audio-text retrieval, contrastive learning, knowledge distillation, topic modeling
Abstract: Language-based audio retrieval involves fetching audio recordings from a database that most closely align with a provided text query. In this paper, we study language-based audio retrieval with a dual encoder and show that (i) soft-label distillation from an ensemble of retrieval teachers, (ii) LLM-driven caption augmentation (back-translation and caption mix for mixed audio), and (iii) cluster-guided auxiliary classification jointly improve robustness to non-binary audio–caption correspondences. On CLOTHO dataset, our best single model reaches mAP@16 46.6, and a weighted ensemble attains 48.8 on the development test split. While cluster guidance yields mixed gains across backbones, ablations indicate consistent improvements under high correspondence ambiguity.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10533
Loading