Enhancing Audio--Language Models through Self--Supervised Post--Training with Text--Audio Pairs

Anshuman Sinha; Camille Migozzi; Aubin REY; Chao Zhang

Enhancing Audio--Language Models through Self--Supervised Post--Training with Text--Audio Pairs

Anshuman Sinha, Camille Migozzi, Aubin REY, Chao Zhang

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Self-supervised learning, Contrastive learning, Post-training, Audio, Zero-shot evaluation, Audio-retrieval.

TL;DR: The paper introduces TeminAL, a two-stage training method that enhances ALMs with temporal understanding, improving performance by 5.28% on ESC-50. ZSTE, a zero-shot evaluation strategy, shows TeminAL outperforms existing models on downstream tasks.

Abstract: Research on multi-modal contrastive learning strategies for audio and text has rapidly gained interest. Contrastively trained Audio-Language Models (ALMs), such as CLAP, which establish a unified representation across audio and language modalities, have enhanced the efficacy in various subsequent tasks by providing good text aligned audio encoders and vice versa. These improvements are evident in areas like zero-shot audio classification and audio retrieval, among others. However, the ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research. In this paper, we propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method $\textbf{TeminAL}$. We implement a two-stage training scheme TeminAL A \& B, where the model first learns to differentiate between multiple sounds in TeminAL A, followed by a phase that instills a sense of time, thereby enhancing its temporal understanding in TeminAL B. This approach results in an average performance gain of $5.28$\% in temporal understanding on the benchmark ESC-50 dataset, while the model remains competitive in zero-shot retrieval and classification tasks on the AudioCap/Clotho datasets. We also note the lack of proper evaluation techniques for contrastive ALMs and propose a strategy for evaluating ALMs in zero-shot settings. The general-purpose Zero-Shot Temporal Evaluation $\textbf{(ZSTE)}$ strategy , is used to evaluate various prior models. ZSTE demonstrates a general strategy to evaluate all ZS contrastive models. The model trained with TeminAL successfully outperforms current models on most downstream tasks.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12416

Loading