Multi-grained Correspondence Learning of Audio-language Models for Few-shot Audio Recognition

Published: 20 Jul 2024, Last Modified: 03 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large-scale pre-trained audio-language models excel in general multi-modal representation, facilitating their adaptation to downstream audio recognition tasks in a data-efficient manner. However, existing few-shot audio recognition methods based on audio-language models primarily focus on learning coarse-grained correlations, which are not sufficient to capture the intricate matching patterns between the multi-level information of audio and the diverse characteristics of category concepts. To address this gap, we propose multi-grained correspondence learning for bootstrapping audio-language models to improve audio recognition with few training samples. This approach leverages generative models to enrich multi-modal representation learning, mining the multi-level information of audio alongside the diverse characteristics of category concepts. Multi-grained matching patterns are then established through multi-grained key-value cache and multi-grained cross-modal contrast, enhancing the alignment between audio and category concepts. Additionally, we incorporate optimal transport to tackle temporal misalignment and semantic intersection issues in fine-grained correspondence learning, enabling flexible fine-grained matching. Our method achieves state-of-the-art results on multiple benchmark datasets for few-shot audio recognition, with comprehensive ablation experiments validating its effectiveness.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Our work makes a substantial contribution to multimedia/multimodal processing, particularly in guiding audio-language models to adapt to downstream audio recognition tasks. By introducing multi-grained correspondence learning, we transcend traditional methods, enriching cross-modal learning theory. This framework effectively integrates audio's multi-level information with textual data, achieving significant breakthroughs in multimedia information fusion. Our method significantly improves audio recognition accuracy, supporting intelligent multimedia content processing. With potential applications in audio content understanding and intelligent audio interaction, our research injects new vitality into AI technologies, enhancing people's lives and work. In summary, our work pushes the boundaries of multimedia/multimodal processing, fostering innovation and societal progress.
Submission Number: 3695
Loading