WavLLM: Towards Robust and Adaptive Speech Large Language Model

ACL ARR 2024 June Submission2703 Authors

15 Jun 2024 (modified: 12 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advancements in large language models (LLMs) have expanded their scope in natural language processing (NLP) to encompass multimodal functions. However, integrating listening capabilities effectively remains a significant challenge for generalization and complex auditory task execution. In this work, we introduce WavLLM, a robust and adaptive speech large language model featuring dual encoders—a Whisper encoder for semantics and a WavLM encoder for speaker characteristics. Within the two-stage curriculum learning framework, WavLLM first builds its foundational capabilities by optimizing on mixed elementary single tasks, followed by advanced multi-task training on more complex tasks such as combinations of the elementary tasks. To enhance the flexibility and adherence to different tasks and instructions, a prompt-aware LoRA weight adapter is introduced in the second advanced multi-task training stage. We validate the proposed model on universal speech benchmarks and also apply it to specialized speech-question-answer (SQA) dataset, and speech Chain-of-Thought (CoT) evaluation set. Experiments demonstrate that the proposed model achieves state-of-the-art performance across a range of speech tasks on the same model size, exhibiting robust generalization capabilities in executing complex tasks using CoT approach. The codes, models, audio samples, and SQA evaluation set can be accessed at \url{https://github.com/wavllm/wavllm-anonymous}.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: automatic speech recognition; speech technologies; spoken language translation; spoken language understanding; QA via spoken queries;
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models
Languages Studied: English; German
Submission Number: 2703
Loading