BLSP: Bootstrapping Language-Speech Pre-training via Behavior AlignmentDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text remains an open problem. Current solutions can be categorized into cascaded approaches, which limit the interaction between speech and LLMs, and end-to-end approaches that rely on scarce speech instruction data. In this paper, we propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment, leveraging existing ASR training data. We achieve this by developing a lightweight modality adapter between a frozen speech encoder and an LLM, optimized to ensure that the LLM exhibits the same generation behavior irrespective of the modality of input: a speech segment or its transcript. We primarily focus on the continuation writing behavior as it closely resembles next-token prediction in a broad sense but also found that introducing other behaviors could lead to improved performance. We demonstrate that this simple process can extend the capabilities of LLMs to speech and achieve competitive performance compared to cascaded systems, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview