Abstract: The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text remains an open problem. Current solutions can be categorized into cascaded approaches, which limit the interaction between speech and LLMs, and end-to-end approaches that rely on scarce speech instruction data. In this paper, we propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment, leveraging existing ASR training data. We achieve this by developing a lightweight modality adapter between a frozen speech encoder and an LLM, optimized to ensure that the LLM exhibits the same generation behavior irrespective of the modality of input: a speech segment or its transcript. We primarily focus on the continuation writing behavior as it closely resembles next-token prediction in a broad sense but also found that introducing other behaviors could lead to improved performance. We demonstrate that this simple process can extend the capabilities of LLMs to speech and achieve competitive performance compared to cascaded systems, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Languages Studied: English
0 Replies
Loading