TTS-Hub: Leveraging Modular LoRAs and  Arithmetic Composition for Controllable Text-to-Speech

Xiang Li; Shiqi Zhang; Jason Xu; Hongru Xiao; Sipei Lin; Fan Bu; Wenyuan Gu; Changwen Chen; Bo Cheng; Zhan Su; Jiale Han; Li Zhou; Benyou Wang

TTS-Hub: Leveraging Modular LoRAs and Arithmetic Composition for Controllable Text-to-Speech

Xiang Li, Shiqi Zhang, Jason Xu, Hongru Xiao, Sipei Lin, Fan Bu, Wenyuan Gu, Changwen Chen, Bo Cheng, Zhan Su, Jiale Han, Li Zhou, Benyou Wang

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Controllable text-to-speech (TTS), modular Low-Rank Adaptation (LoRA), LoRA Composition

Abstract: Controllable text-to-speech (TTS) aims to generate speech from text while allowing control over prosodic and speaker-related attributes such as pitch, age, and accent. Existing controllable TTS methods primarily rely on natural language prompts to guide the synthesis process or utilize reference audio cloning to achieve control. However, prompt-based approaches often struggle with the cross-modal semantic gap between textual descriptions and intended speech attributes, leading to imprecise and coarse-grained control. Conversely, cloning methods depend heavily on reference audio samples and struggle to generalize beyond the characteristics seen in those samples, resulting in limited flexibility. To overcome these challenges, this paper proposes TTS-Hub, a novel controllable TTS framework that employs modular Low-Rank Adaptation (LoRA) components and their arithmetic-based composition to achieve fine-grained and flexible controllable TTS. Specifically, we construct a comprehensive Data Hub, which covers 6 high-level attribute categories and 32 fine-grained speech attributes. Leveraging this attribute-specific data, we fine-tune two mainstream TTS frameworks to obtain a corresponding LoRA Hub, where each modular LoRA is specialized for a specific speech attribute. At inference time, TTS-Hub selects the required LoRA modules and combines them through simple arithmetic composition to produce a fused LoRA that simultaneously encodes multiple attribute representations, enabling flexible and extensible multi-attribute control without retraining the backbone. Extensive experiments show that individual LoRAs provide precise single‑attribute control, while arithmetic composition yields flexible and interpretable multi‑attribute speech and consistently outperforms prompt‑based baselines. Code and data are available in supplementary materials.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 16654

Loading