Extending Large Language Models for Speech and Audio Captioning

Published: 2024, Last Modified: 20 May 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multimodal large language models (LLMs) have shown promising visual perception abilities by connecting with image encoders, but their performance on auditory tasks has not yet been widely investigated. Meanwhile, automatic speech recognition (ASR) and automatic audio captioning (AAC) are often achieved with separate systems, resulting in incomplete auditory perception abilities. To fill in these gaps, in this paper, we present the first study that achieves both ASR and AAC by connecting an LLM with auditory encoders. A dual auditory encoder structure is proposed, integrating the Whisper encoder for speech and the BEATs encoder for audio events with a high temporal resolution by using a Q-Former at the window level. Experiments for ASR and AAC are performed correspondingly on the widely used LibriSpeech, GigaSpeech, WavCaps, AudioCaps, and Clotho datasets and yield promising results. In particular, state-of-the-art results are achieved on GigaSpeech, AudioCaps and Clotho. Our model is also able to caption speech and audio events simultaneously from clips with mixed speech and background audio events, which is a step towards more complete machine auditory perception.
Loading