Keywords: Large Audio Language Models (LALMs), Audio Large Language Models (Audio-LLMs)
TL;DR: We propose LAL, a lightweight audio-LLM integration that improves efficiency while preserving performance, and show that different audio encoders benefit from different integration strategies, motivating PAL, an encoder-aware integrated audio-LLM.
Abstract: Integration of audio perception into large language models (LLMs) is an emerging research area for enabling machine listening applications, yet efficient transfer of rich audio semantics from audio encoders to LLMs remains underexplored. The most widely used integration paradigm projects the audio encoder output tokens into the LLM input space (e.g., via an MLP or a Q-Former), then \emph{prepends or inserts} them to the text tokens. We refer to this generic scheme as \emph{Prepend to the LLM’s input token space (PLITS)} integration. We propose an efficient alternative, \underline{L}ightweight \underline{A}udio \underline{L}LM Integration \textbf{(LAL)}. LAL introduces audio representations solely via the attention mechanism within different layers of the LLM, bypassing its feedforward module. LAL encodes rich audio semantics at an appropriate level of abstraction for integration into different blocks of LLMs. Our design significantly reduces computational overhead compared to existing integration approaches. Observing that Whisper style speech encoders benefit from PLITS integration, we propose an audio encoder aware approach for efficiently \underline{P}robing \underline{A}udio encoders via \underline{L}LM (\textbf{PAL}), which in its multi encoder form employs PLITS for Whisper speech encoder and LAL for general audio encoders, and in its unified encoder form uses a single audio encoder but applies PLITS only to a compact set of speech summary tokens while integrating the full audio token sequence via LAL to preserve speech decoding capacity with low computational cost. Under an identical training curriculum, \textbf{LAL} consistently maintains performance or outperforms existing integration approaches across multiple base LLMs and tasks. For general audio tasks, LAL achieves improvements of up to 30\% over a strong PLITS baseline, while reducing memory usage by about 60\% and increasing throughput by about 190\%. Furthermore, for general audio-music-speech LLM, \textbf{PAL}, performs on par with a fully PLITS integration-based system but with substantially improved computational and memory efficiency.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20556
Loading