Whisper-UT: A Unified Translation Framework for Speech and Text

Whisper-UT: A Unified Translation Framework for Speech and Text

ACL ARR 2025 February Submission7987 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Encoder-decoder models have achieved remarkable success in speech and text tasks, yet efficiently adapting these models to diverse uni/multi-modal scenarios remains an open challenge. In this paper, we propose Whisper-UT, a unified and efficient framework that leverages lightweight adapters to enable seamless adaptation across tasks, including a multi-modal machine translation (MMT) task that explicitly conditions translation on both speech and source language text inputs. By incorporating ASR hypotheses or ground-truth transcripts as prompts, this approach not only enables the system to process both modalities simultaneously but also significantly enhances speech translation (ST) performance through a 2-stage decoding strategy. While demonstrated using the Whisper model, our methods are to other similar multi-task systems. Experiments on multiple conversational speech translation corpora show that our approach achieves strong performance, surpassing multiple baselines. Additionally, we highlight the effectiveness of cross-modal and cross-task fine-tuning, which improves performance without requiring 3-way parallel data.

Paper Type: Long

Research Area: Machine Translation

Research Area Keywords: speech translation

Languages Studied: Spanish, Chinese, English

Submission Number: 7987

Loading