A Survey of Audio Language Models: Data, Architecture and Training Strategies

ACL ARR 2025 May Submission497 Authors

13 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent breakthroughs in large language models (LLMs), alongside powerful speech models achieving high zero-shot accuracy (e.g., Whisper), have catalyzed the emergence of Audio LLMs---unified models bridging acoustic and linguistic modalities. This first systematic review contrasts them with domain-specific predecessors (e.g., Wav2Vec 2.0 for speech, BERT for text). We analyze audio's dual nature through HuBERT units and expose data biases (e.g., 82% English in Common Voice vs. <3% Swahili). Architecturally, block-sparse attention (BSA) cuts memory use by 40% for 1-hour audio. Alignment strategies like multimodal prompting achieve 90% voice cloning similarity with 3-second references. However, challenges remain: 40-60% higher WER in low-resource languages, ~50t CO₂ emissions per 1B-parameter model, and 300% annual rise in voice spoofing. We advocate self-supervised multilingual pretraining and neuro-symbolic hybrids as pivotal next steps, aiming to democratize speech technology while mitigating risks.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: Audio, Speech, Spoken Language Understanding, Audio LLMs, Multimodal
Contribution Types: Surveys
Languages Studied: English
Keywords: Audio LLMs, Speech Processing, Large Language Models, Multimodal Learning, Self-supervised Learning, Post-training, Data Bias, Safety
Submission Number: 497
Loading