ASK: Adaptive Self-improving Knowledge Framework for Audio Text Retrieval

ACL ARR 2026 January Submission595 Authors

23 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Audio-Text Retrieval, Multimodal Learning, Contrastive Learning, Knowledge Augmentation, Representation Learning
Abstract: The dominant paradigm for Audio-Text Retrieval (ATR) relies on mini-batch-based contrastive learning. This process is constrained by what we define as the Gradient Locality Bottleneck (GLB), where optimization is limited to in-batch contrasts. This restricts the model from leveraging out-of-batch knowledge and consequently hinders long-tail learning. While external knowledge-enhanced methods can alleviate the GLB, we identify a critical, unaddressed side effect: the Representation-Drift Mismatch (RDM), where a static knowledge base becomes progressively misaligned with the evolving model, turning guidance into noise. To address this dual challenge, we propose the Adaptive Self-improving Knowledge (ASK) framework, a model-agnostic, plug-and-play solution. ASK breaks the GLB via multi-grained knowledge injection, systematically mitigates RDM through dynamic knowledge refinement, and introduces a novel adaptive reliability weighting scheme to ensure consistent knowledge contributes to optimization. State-of-the-art performance on established benchmarks demonstrates the efficacy of our proposed ASK framework.
Paper Type: Long
Research Area: Information Extraction and Retrieval
Research Area Keywords: contrastive learning; knowledge base construction; dense retrieval;
Languages Studied: English
Submission Number: 595
Loading