Cooperative Game Modeling With Weighted Token-Level Alignment for Audio-Text Retrieval

Yifei Xin, Baojun Wang, Lifeng Shang

Published: 2023, Last Modified: 17 Mar 2026IEEE Signal Process. Lett. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Previous audio-text retrieval (ATR) methods primarily concentrate on constructing contrastive pairs between entire audio clips and full caption sentences, while neglecting fine-grained cross-modal relationships. In this letter, we first introduce a weighted token-level alignment (WTA) module for ATR to learn fine-grained semantic interactions. Besides, due to the unavailability of manually labeling the fine-grained sequential correspondence between audio-text pairs, we attempt to model ATR as a cooperative game process to flexibly handle the uncertainty during audio-text semantic interactions. Specifically, we treat audio frames and text words as players and present a game theoretic interaction (GTI) method to assess potential correspondence between audio frames and text words, which can also be seen as an additional learning signal to improve the pure audio-text contrastive learning. Furthermore, to implement multi-level WTA and GTI, we develop a token cluster module to cluster the frames/words and calculate the interaction scores between the clustered tokens. Experiments show that our WTA significantly improves the ATR performance on multiple datasets. By combining our GTI, the retrieval performance is further boosted by a large margin.