When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

06 Mar 2026 (modified: 16 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models are increasingly adopted as semantic backbones for neural text-to-speech systems. However, frozen LLM representations are insufficient for modeling speaker-specific acoustic and perceptual characteristics. Our experiments involving fine tuning of the Language Model backbone of TTS show promise in improving the voice consistency and Signal to Noise ratio (SNR) in voice cloning task. Across multiple speakers, LoRA fine-tuning consistently outperforms the non–fine-tuned base Qwen-0.5B model across three complementary dimensions of speech quality. First, perceptual quality improves significantly, with DNS-MOS gains of up to +0.42 points for speakers whose training data exhibits sufficient acoustic variability. Second, speaker fidelity improves for all evaluated speakers, with consistent increases in voice similarity, indicating that LoRA effectively adapts speaker identity representations without degrading linguistic modeling. Third, signal-level quality improves in most cases, with signal-to-noise ratio increasing by as much as 34 percent. Crucially, these improvements are strongly governed by the characteristics of the training data. Speakers with high variability in acoustic energy and perceptual quality achieve simultaneous gains in DNS-MOS, voice similarity, and SNR. In contrast, speakers trained on acoustically homogeneous data experience limited gains or perceptual degradation, even when voice similarity improves. This reveals that LoRA can faithfully clone speaker identity while also amplifying noise characteristics and recording artifacts present in narrow training distributions. We further identify a loss–quality divergence phenomenon in which training and validation loss continue to improve during fine-tuning while perceptual quality degrades for low-variability speakers. Besides, we show that optimal inference temperature of the language model backbone depends on training data variability, with conservative sampling benefiting low-variability speakers but degrading quality for high-variability ones. Overall, this work establishes that LoRA fine-tuning is not merely a parameter-efficient optimization technique but an effective mechanism for better speaker-level adaptation in compact LLM-based TTS systems. When supported by sufficiently diverse training data, LoRA-adapted Qwen-0.5B consistently surpasses its frozen base model in perceptual quality, speaker similarity with low latency using GGUF model hosted in quantized form.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=rre558DdVn
Changes Since Last Submission: It was earlier rejected due to template not being same as for TMLR. Now, updated the template to the one available at https://jmlr.org/tmlr/author-guide.html
Assigned Action Editor: ~Hongyang_R._Zhang1
Submission Number: 7800
Loading