TL;DR: We propose MATS, an audio-language multimodal LLM designed to handle Multiple Audio task using solely Text-only Supervision.
Abstract: Large audio-language models (LALMs), built upon powerful Large Language Models (LLMs), have exhibited remarkable audio comprehension and reasoning capabilities.
However, the training of LALMs demands a large corpus of audio-language pairs, which requires substantial costs in both data collection and training resources. In this paper, we propose **MATS**, an audio-language multimodal LLM designed to handle **M**ultiple **A**udio task using solely **T**ext-only **S**upervision. By leveraging pre-trained audio-language alignment models such as CLAP, we develop a text-only training strategy that projects the shared audio-language latent space into LLM latent space, endowing the LLM with audio comprehension capabilities without relying on audio data during training. To further bridge the modality gap between audio and language embeddings within CLAP, we propose the **S**trongly-rel**a**ted **n**oisy **t**ext with **a**udio (**Santa**) mechanism. Santa maps audio embeddings into CLAP language embedding space while preserving essential information from the audio input. Extensive experiments demonstrate that MATS, despite being trained exclusively on text data, achieves competitive performance compared to recent LALMs trained on large-scale audio-language pairs. The code is publicly available in [https://github.com/wangwen-banban/MATS](https://github.com/wangwen-banban/MATS)
Lay Summary: Training large audio-language models (LALMs) is often expensive, as it requires large-scale paired audio-text data, which is costly to collect and process. To address this limitation, we ask: can a model learn to understand audio without ever being exposed to audio during training?\
We present **MATS**, a multimodal large language model designed to perform a wide range of audio tasks using only text supervision. Our core idea is to leverage pre-trained audio-language alignment models such as CLAP, which embed audio and text into a shared latent space. By projecting this shared space into the language model’s embedding space, we enable MATS to acquire audio comprehension abilities without any audio data during training. To further enhance the audio-language alignment, we propose **Santa**, a mechanism that balances the similarity-weighted language representations of audio embeddings and the original audio features, preserving essential audio information. \
Remarkably, despite never seeing audio during training, MATS achieves competitive performance compared to models trained on large-scale audio-language datasets. Our approach offers a promising direction for building cost-efficient, scalable multimodal systems.
Link To Code: https://github.com/wangwen-banban/MATS
Primary Area: Deep Learning->Large Language Models
Keywords: Large Audio-Language Models, Large Language Models, Audio-Language Contrastive Models
Submission Number: 8803
Loading