Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Anonymous

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We propose an S2ST framework with style-transfer capability on the basis of discrete self-supervised speech representations and codec units. The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming the issue of data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and style similarity. Audio samples are available at http://stylelm.github.io/ .

Paper Type: short

Research Area: Speech recognition, text-to-speech and spoken language understanding

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English, French, Spanish

0 Replies

Loading