LLaSO: A Reproducible Foundation for Large Speech-Language Models

ICLR 2026 Conference Submission9566 Authors

17 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Speech-Language Models, speech-text alignment, instruction tuning, multimodal evaluation, paralinguistics, modality robustness, open-source benchmark, reproducibility
TL;DR: We present LLaSO, an open LSLM stack, 12M Align, 13.5M Instruct, 15K Eval, and a 3.8B model. Across models, broader task coverage helps, but generalization to unseen modalities, esp. pure audio, lags; interleaving/parallel decoding helps robustness.
Abstract: The development of Large Speech-Language Models (LSLMs) has been limited by fragmented architectures and poor transparency, making reproducibility and fair comparison difficult. In contrast to the vision–language domain, where open resources have driven rapid progress, LSLMs are often released only as model weights without their training data or configurations, leaving the field without common baselines. We present LLaSO, the first fully open, end-to-end framework for large-scale speech–language modeling. LLaSO consists of three key components: (1) LLaSO-Align, a 12M-instance speech–text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset for speech–text understanding; and (3) LLaSO-Eval, a standardized, reproducible benchmark for cross-modal evaluation. To demonstrate its utility, we train LLaSO-Base, a 3.8B-parameter reference model built entirely on public data. LLaSO-Base achieves a normalized score of 0.72, outperforming comparable models and providing a strong, reproducible baseline. Our analysis further shows that while broader training coverage improves performance, significant generalization gaps remain, especially in speech-only scenarios. By releasing datasets, benchmarks, and models together, LLaSO establishes an open standard for LSLMs, enabling unified research and faster community progress.
Primary Area: applications to computer vision, audio, language, and other modalities
Supplementary Material: zip
Submission Number: 9566
Loading