openLISTEN: Freestyle Cross-Modal Instruction Compliance for Large Speech-Language Models with Limited Resources

ACL ARR 2026 January Submission5860 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Speech-Language Models, Freestyle Instruction, Modality Alignment, Gated Cross-Attention, Conversational AI
Abstract: Recent advancements in Large Language Models (LLMs) have revolutionized text comprehension, yet bridging the gap to speech-native understanding remains a challenge due to the loss of paralinguistic features in cascaded systems and the high computational costs of end-to-end vocabulary expansion. To address these limitations, we propose openLISTEN, a resource-efficient Large Speech-Language Model (LSLM) trained entirely on consumer-grade GPUs. openLISTEN integrates Gated Cross-Attention (GCA) with Open-Domain Cross-Modal Instruction Tuning to learn robust audio--text alignment from only 500+ hours of paired data. Extensive evaluations on URO-Bench show that openLISTEN performs strongly under resource-efficient training, and controlled ablations consistently favor GCA over alternative fusion designs. Furthermore, empirical results on cross-modal instruction compliance benchmarks indicate that our approach effectively mitigates the rigid response patterns and modality bias typically exacerbated by limited training data, thereby significantly enhancing instruction adherence and generalization in open-domain scenarios. The code will be available at https://anonymous.4open.science/r/openLISTEN-8D11
Paper Type: Long
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: spoken language understanding, spoken dialog, speech technologies, multimodality, model architectures, LLM Efficiency, data-efficient training
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 5860
Loading