Keywords: Audio-Text Retrieval, Multimodal Learning, Contrastive Learning, Knowledge Augmentation, Representation Learning
Abstract: The dominant paradigm for Audio-Text Retrieval (ATR) relies on mini-batch-based contrastive learning. This process is constrained by what we define as the Gradient Locality Bottleneck (GLB), where optimization is limited to in-batch contrasts. This restricts the model from leveraging out-of-batch knowledge and consequently hinders long-tail learning. While external knowledge-enhanced methods can alleviate the GLB, we identify a critical, unaddressed side effect: the Representation-Drift Mismatch (RDM), where a static knowledge base becomes progressively misaligned with the evolving model, turning guidance into noise. To address this dual challenge, we propose the Adaptive Self-improving Knowledge (ASK) framework, a model-agnostic, plug-and-play solution. ASK breaks the GLB via multi-grained knowledge injection, systematically mitigates RDM through dynamic knowledge refinement, and introduces a novel adaptive reliability weighting scheme to ensure consistent knowledge contributes to optimization. State-of-the-art performance on established benchmarks demonstrates the efficacy of our proposed ASK framework.
Paper Type: Long
Research Area: Information Extraction and Retrieval
Research Area Keywords: contrastive learning; knowledge base construction; dense retrieval;
Languages Studied: English
Submission Number: 595
Loading