When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

Anupam Purwar; Aditya Choudhary

When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

Anupam Purwar, Aditya Choudhary

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: LLM-based Text-to-Speech (TTS), LoRA fine-tuning, Qwen-0.5B backbone, acoustic token prediction, speaker similarity, DNS-MOS (OVRL), WADA-SNR, acoustic diversity and energy variability, GGUF quantization, multi-speaker fine-tuning

TL;DR: We fine-tune a Qwen-0.5B TTS backbone with LoRA to lift MOS, speaker similarity, and SNR. Gains depend on diverse training audio; with uniform data artifacts can grow. Tuning decoding and GGUF helps achieve low-latency, stable voice cloning

Abstract: Large language models are increasingly adopted as semantic backbones for neural text-to-speech systems. However, frozen LLM representations are insufficient for modeling speaker-specific acoustic and perceptual characteristics. Our experiments involving fine tuning of the Language Model backbone of TTS show promise in improving the voice consistency and Signal to Noise ratio (SNR) in voice cloning task. Across multiple speakers, LoRA fine-tuning consistently outperforms the non–fine-tuned base Qwen-0.5B model across three complementary dimensions of speech quality. First, perceptual quality improves significantly, with DNS-MOS gains of up to +0.42 points for speakers whose training data exhibits sufficient acoustic variability. Second, speaker fidelity improves for all evaluated speakers, with consistent increases in voice similarity, indicating that LoRA effectively adapts speaker identity representations without degrading linguistic modeling. Third, signal-level quality improves in most cases, with signal-to-noise ratio increasing by as much as 34 percent. Crucially, these improvements are strongly governed by the characteristics of the training data. Speakers with high variability in acoustic energy and perceptual quality achieve simultaneous gains in DNS-MOS, voice similarity, and SNR. In contrast, speakers trained on acoustically homogeneous data experience limited gains or perceptual degradation, even when voice similarity improves. This reveals that LoRA can faithfully clone speaker identity while also amplifying noise characteristics and recording artifacts present in narrow training distributions. We further identify a loss–quality divergence phenomenon in which training and validation loss continue to improve during fine-tuning while perceptual quality degrades for low-variability speakers. Besides, we show that optimal inference temperature of the language model backbone depends on training data variability, with conservative sampling benefiting low-variability speakers but degrading quality for high-variability ones. Overall, this work establishes that LoRA fine-tuning is not merely a parameter-efficient optimization technique but an effective mechanism for better speaker-level adaptation in compact LLM-based TTS systems. When supported by sufficiently diverse training data, LoRA-adapted Qwen-0.5B consistently surpasses its frozen base model in perceptual quality, speaker similarity with low latency using GGUF model hosted in quantized form.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 117

Loading