![SpeechHub文档](./SpeechHub_Main.png)

# TTS Hub Usage Guide

This repository contains usage tutorials and tools for for our powerful LoRA-Hub in two text-to-speech (TTS) models:

- **Parler-TTS**: Lightweight, high-quality text-to-speech model
- **VoiceLDM**: Diffusion-based speech generation system

## 📋 Table of Contents

- [Parler-TTS Usage Guide](#parler-tts-usage-guide)
  - [Model Introduction](#model-introduction)
  - [Basic Usage](#basic-usage)
- [VoiceLDM Usage Guide](#voiceldm-usage-guide)
  - [Model Introduction](#model-introduction-1)
  - [Basic Usage](#basic-usage-1)

---

## Parler-TTS Usage Guide

### Model Introduction

Parler-TTS is a lightweight text-to-speech model that can generate high-quality, natural-sounding speech. It is an open-source implementation, you can check this [github](https://github.com/huggingface/parler-tts) to find more information.

**Available Models:**

In our TTS-Hub, we choose Parler-TTS Mini as our backbone. you can get more information from [Hugging Face](https://huggingface.co/parler-tts/parler-tts-mini-v1).

- **Parler-TTS Mini**: 880M parameter model

### Basic Usage

#### Single LoRA Model Inference

```python
import os
import sys
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
from peft import PeftModel
import soundfile as sf
import torch

device = "cuda:0"

base_model_name = "parler-tts/parler-tts-mini-v1"

peft_model_path = "SET YOUR PEFT MODEL PATH HERE"

description = "PLEASE WRITE YOUR DESCRIPTION HERE."

prompt = "PLEASE WRITE YOUR TRANSCRIPTION HERE."

adapter_name_a = "lora_a"

output_file = f"./{adapter_name_a}_single.wav"

base_model = ParlerTTSForConditionalGeneration.from_pretrained(base_model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

os.makedirs(os.path.dirname(output_file), exist_ok=True)

tokenized_input = tokenizer(description, return_tensors="pt")
input_ids = tokenized_input.input_ids.to(device)
attention_mask = tokenized_input.attention_mask.to(device)

tokenized_prompt = tokenizer(prompt, return_tensors="pt")
prompt_input_ids = tokenized_prompt.input_ids.to(device)
prompt_attention_mask = tokenized_prompt.attention_mask.to(device)

gen_kwargs = {
    "do_sample": True,
    "temperature": 1.0,
}

peft_model = PeftModel.from_pretrained(base_model, peft_model_path, adapter_name = adapter_name_a)

peft_model.set_adapter(adapter_name_a)

model = peft_model.merge_and_unload()

print("Start generate...")
with torch.no_grad():
    generation = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        prompt_input_ids=prompt_input_ids,
        prompt_attention_mask=prompt_attention_mask,
        **gen_kwargs
    )

print("Finish generate...")

audio_arr = generation.cpu().numpy().squeeze()

sf.write("single.wav", audio_arr, model.config.sampling_rate)
```

#### Multiple LoRA Model Fusion Inference

```python
import os
import sys
import argparse
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname("."), "..")))
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
from peft import PeftModel
import soundfile as sf
import torch


device = "cuda:0"

base_model_name = "parler-tts/parler-tts-mini-v1"

prompt = "PLEASE WRITE YOUR TRANSCRIPTION HERE."

description = "PLEASE WRITE YOUR DESCRIPTION HERE."

lora_path_a = "SET YOUR PEFT MODEL A PATH HERE"
lora_path_b = "SET YOUR PEFT MODEL B PATH HERE"

adapter_name_a = "LORA_A"

adapter_name_b = "LORA_B"

output_file = f"./{adapter_name_a}_{adapter_name_b}_double.wav"

temperature = 1.0

density = 0.5

adapter_weights = [0.5, 0.5]

combination_type = "cat"

gen_kwargs = {
    "do_sample": True,
    "temperature": 1.0,
}


base_model = ParlerTTSForConditionalGeneration.from_pretrained(base_model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

os.makedirs(os.path.dirname(output_file), exist_ok=True)

tokenized_input = tokenizer(description, return_tensors="pt")
input_ids = tokenized_input.input_ids.to(device)
attention_mask = tokenized_input.attention_mask.to(device)

tokenized_prompt = tokenizer(prompt, return_tensors="pt")
prompt_input_ids = tokenized_prompt.input_ids.to(device)
prompt_attention_mask = tokenized_prompt.attention_mask.to(device)

fused_model = PeftModel.from_pretrained(base_model, lora_path_a, adapter_name=adapter_name_a)
fused_model.load_adapter(lora_path_b, adapter_name=adapter_name_b)

weighted_adapter_name = "merge"
fused_model.add_weighted_adapter(
    adapters=[adapter_name_a, adapter_name_b],
    weights=adapter_weights,
    density=density,
    adapter_name=weighted_adapter_name,
    combination_type=combination_type
)

fused_model.set_adapter(weighted_adapter_name)
fused_model = fused_model.merge_and_unload()

print("Start generate...")
with torch.no_grad():
    fused_generation = fused_model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        prompt_input_ids=prompt_input_ids,
        prompt_attention_mask=prompt_attention_mask,
        **gen_kwargs
    )
print("Finish generate...")
fused_audio_arr = fused_generation.cpu().numpy().squeeze()
sf.write(output_file, fused_audio_arr, base_model.config.sampling_rate)

```


---

## VoiceLDM Usage Guide

### Model Introduction

VoiceLDM is a diffusion-based speech generation system that supports controlling speech generation through text descriptions and audio prompts. you can check this [github](https://github.com/glory20h/VoiceLDM) to find more information.

**Model Configurations:**

In our TTS-Hub, we choose VoiceLDM-M as our backbone. you can get more information from [Hugging Face](https://huggingface.co/cvssp/audioldm-m-full).

- **VoiceLDM-M**: 652M parameter model

### Basic Usage

#### Single LoRA Generation

```bash
python ./voiceldm/single_infer.py \
  --desc_prompt "A person is speaking with the England accent. Clean speech!" \
  --cont_prompt "Good morning! How are you doing today? Would you like a glass of water?" \
  --ckpt_path "PATH_TO_YOUR_VOICELDM_PATH" \
  --lora_path "PATH_TO_YOUR_LORA_PATH" \
  --output_dir "./output" \
  --file_name "england_lora.wav"\
  --trim_silence
```
#### Multiple LoRA Model Fusion Inference
The multi-lora merge script can be used in a similar way.

## License

- Parler-TTS: Follows the original project license
- VoiceLDM: Follows the original project license

## Acknowledgments

Thanks to the contributors of the two open-source projects:
- [Parler-TTS](https://github.com/huggingface/parler-tts)
- [VoiceLDM](https://github.com/glory20h/VoiceLDM)