Abstract: We introduce DiffuseST, a low-latency, direct speechto-speech translation system capable of preserving the input
speaker’s voice zero-shot while translating from multiple source
languages into English. We experiment with the synthesizer
component of the architecture, comparing a Tacotron-based
synthesizer to a novel diffusion-based synthesizer. We find the
diffusion-based synthesizer to improve MOS and PESQ audio
quality metrics by 23% each and speaker similarity by 5% while
maintaining comparable BLEU scores. Despite having more
than double the parameter count, the diffusion synthesizer has
lower latency, allowing the entire model to run more than 5×
faster than real-time.
Loading