Variable-Length Speaker Conditioning in Flow-Based Text-to-Speech

Byoung Jin Choi, Myeonghun Jeong, Minchan Kim, Nam Soo Kim

Published: 01 Jan 2024, Last Modified: 09 Jan 2026IEEE Signal Processing LettersEveryoneRevisionsCC BY-SA 4.0

Abstract: In this letter, we propose a novel speaker conditioning technique that leverages a variable-length reference embedding sequence for flow-based text-to-speech (TTS) architecture in the context of zero-shot multi-speaker text-to-speech (ZSM-TTS). Unlike conventional ZSM-TTS methods, which usually rely on a single fixed-dimensional vector to represent the entire reference speech, our approach aims to extract variable-length embedding sequence for a more flexible and efficient conditioning. We enhance the current affine coupling function in flow-based TTS architecture by introducing an attentive speaker conditioning. This allows a local variation of the speaker conditioning. Our experiments demonstrate the effectiveness of the proposed method, highlighting improvements in terms of speaker similarity, speech naturalness, and speech intelligibility compared to the baseline methods.

External IDs:doi:10.1109/lsp.2024.3377588