Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances have been witnessed in audio-language joint learning, such as CLAP, that shows much success in multi-modal understanding tasks. These models usually aggregate uni-modal local representations, namely frame or word features, into global ones, on which the contrastive loss is employed to reach coarse-grained cross-modal alignment. However, frame-level correspondence with texts may be ignored by the above paradigm, making it ill-posed on explainability and fine-grained text-audio challenges (e.g., text-to-audio grounding) which may also undermine performances on coarse-grained tasks. In this work, we aim to improve both coarse- and fine-grained audio-language alignment in large-scale contrastive pre-training. To unify the granularity and latent distribution of two modalities, a shared codebook is adopted to represent multi-modal global features with common bases, and each internal codeword is regularized to encode modality-shared semantics, bridging the gap between frame and word features. Based on the above framework, a locality-aware block is involved to purify local patterns, and a hard-negative guided loss is devised to boost alignment effects. Extensive experiments on eleven zero-shot coarse- and fine-grained evaluation protocols suggest that our model not only surpasses the baseline CLAP significantly but also yields superior or competitive results compared to current SOTA works. The code and model will be released upon paper acceptance.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Generation] Multimedia Foundation Models, [Content] Multimodal Fusion
Relevance To Conference: Our work focuses on language-audio joint pre-training, aiming to seek both coarse-grained and fine-grained alignment between audio and text modalities, which is a key concern and challenge of multimedia/multimodal interpretation. The resulting language-audio foundation model can then be applied to broad cross-modal understanding and generation tasks related to sounds, such as audio-text retrieval, text-to-audio grounding, and text-guided audio generation, and demonstrate better performance, potential and scalability compared to prior works.
Submission Number: 2736
Loading