Abstract: This paper investigates cascade approaches bridging speech and text foundation models (FMs) for speech translation (ST). We address the limitations of cascade systems which suffer from the propagation of speech recognition errors and the lack of access to acoustic information. We propose a ReShape Attention (RSA) that bridges speech embeddings of Whisper, a speech FM, to LLaMA2, a text FM. Speech and text embeddings have temporal and dimensional gaps, which make merging them challenging. RSA reshapes the speech and text embeddings into a sequence of subvectors sharing the same feature dimension. RSA performs cross-attention in the LLaMA2 layers between these two sequences, which allows combining the two embeddings. The RSA allows text FM to directly access speech FM embeddings and optimize the entire ST system for input speech. RSA improves 8.5% relative BLEU score compared to the baseline ST system, which cascades Whisper and LLaMA2. Moreover, our analyses show that the proposed method could even improve performance with ground-truth transcriptions, which suggests that our bridging approach is not limited to mitigating the effect of recognition errors but can also exploit the benefit of acoustic information.
External IDs:dblp:conf/icassp/KanoODCFMA025
Loading