Bayesian Speech Synthesisers Can Learn from Multiple Teachers

Ziyang Zhang; Yifan Gao; Xuenan Xu; Baoxiangli; Wen Wu; Chao Zhang

Bayesian Speech Synthesisers Can Learn from Multiple Teachers

Ziyang Zhang, Yifan Gao, Xuenan Xu, Baoxiangli, Wen Wu, Chao Zhang

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: TTS, Bayesian Evidential Learning, Multi-teacher Distillation

Abstract: Codec-based text-to-speech (TTS) models have recently gained traction for their efficiency and strong performance in voice cloning. However, codec-based TTS faces limitations due to the challenges of pretraining robust speech codecs and the quality degradation introduced by quantization errors. Emerging evidence suggests that continuous-valued generative models can alleviate these issues and serve as a promising alternative. Yet, effectively modelling diverse speech patterns and developing reliable sampling strategies for continuous-valued autoregressive (AR) TTS remains underexplored. In this work, we propose BELLE, \textbf{B}ayesian \textbf{e}vidential \textbf{l}earning with \textbf{l}anguag\textbf{e} modelling for TTS, a novel continuous-valued AR framework that directly predicts mel-spectrograms from textual input. BELLE treats each mel-spectrogram frame as a Gaussian distribution sampled from a learned hyper-distribution, enabling principled uncertainty estimation, particularly in scenarios with parallel data (\textit{i.e.}, one text-audio prompt paired with multiple speech samples). To obtain such data, diverse speech samples are synthesized using multiple pre-trained TTS models given the same text-audio prompts, which are distilled into BELLE via Bayesian evidential learning. Experimental results indicate that BELLE demonstrates highly competitive performance compared with the current best open-source TTS models, even though BELLE is trained on a large amount of synthetic data and uses only approximately one-tenth of their training data. Audio samples generated by BELLE are available at \url{https://belletts.github.io/Belle/}. The code, checkpoints, and synthetic data will be released after the paper is accepted.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5620

Loading