Cross-Attention Reprogramming for ASR: Bridging Discrete Speech Units and Pretrained Language Models

Published: 2026, Last Modified: 10 Feb 2026IEEE Access 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In automatic speech recognition (ASR), an emerging trend involves converting continuous speech features into sequences of discrete speech units (DSUs) via quantization. A key advantage of DSU representations is their compatibility with pretrained language models (PLMs), where DSUs are directly mapped to PLM token indices and the embedding layer is fine-tuned. However, this conventional strategy often relies heavily on large-scale training data to mitigate the inherent modality mismatch. In light of this, we explore a more effective way to exploit the PLM embedding dictionary. Drawing inspiration from Time-LLM, a recent time-series forecasting model, we propose a cross-attention reprogramming mechanism that incorporates codebook information from the DSU quantizer to better align the DSUs with the PLM embeddings. Compared to direct fine-tuning of PLM embeddings, our method consistently achieves improvements on the Discrete Audio and Speech Benchmark (DASB), reaching state-of-the-art performance across most DASB-style settings. We also evaluate our method on LibriSpeech-960, LibriLight-10, and Swedish, Czech, and Hungarian data from Common Voice, and observe similar trends. Notably, the proposed reprogramming method demonstrates significant gains over the fine-tuning baseline, particularly in cross-lingual and low-resource scenarios. This study proposes a new approach to using PLM embedding dictionaries in DSU-based ASR, and lays a foundation for combining speech representations with large language models in other discriminative tasks of speech processing such as speech emotion recognition and spoken question answering. The inference recipe is available at https://github.com/d06921018/CATT-RPG_for_ASR
Loading