LLM-Augmented Soft-Label Distillation and Cluster-Guided Alignment for Language-Based Audio Retrieval

Hyun Jun Kim; HyeongYong Choi; Changwon Lim

LLM-Augmented Soft-Label Distillation and Cluster-Guided Alignment for Language-Based Audio Retrieval

Hyun Jun Kim, HyeongYong Choi, Changwon Lim

18 Sept 2025 (modified: 19 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Audio-text retrieval, contrastive learning, knowledge distillation, topic modeling

Abstract: Language-based audio retrieval involves fetching audio recordings from a database that most closely align with a provided text query. In this paper, we study language-based audio retrieval with a dual encoder and show that (i) soft-label distillation from an ensemble of retrieval teachers, (ii) LLM-driven caption augmentation (back-translation and caption mix for mixed audio), and (iii) cluster-guided auxiliary classification jointly improve robustness to non-binary audio–caption correspondences. On CLOTHO dataset, our best single model reaches mAP@16 46.6, and a weighted ensemble attains 48.8 on the development test split. While cluster guidance yields mixed gains across backbones, ablations indicate consistent improvements under high correspondence ambiguity.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 10533

Loading