# Overview
We have included the following supplementary materials:

- Demonstration examples for UniTTS, including inference code usage and the complete training code for UniTTS.

- Demonstration examples for DistilCodec, along with inference code usage and the full training code for DistilCodec.

Note: 
- We will open-sourced all models and training code related to UniTTS and DistilCodec.

- Due to size limitations of the supplementary materials, we did not include the trained models in the supplementary documents.


# Supplementary of UniTTS

## **Demos**

Our model can generate audio that maintains the timbre of the reference audio while producing emotionally expressive output tailored to the context of the target sample. Here are some demos generated by UniTTS. 


| Ref Audio | Inference Text (中/英) | Inference Audio |
|-----------|-----------------------|-----------------|
| <audio controls src="./UniTTS/demos/voice0/system_audio.wav"><a href="./UniTTS/demos/voice0/system_audio.wav">Download</a></audio> | 求求你…不要离开我，我真的好害怕…<br/>Please... don't leave me, I'm so scared... | <audio controls src="./UniTTS/demos/voice0/infer_0.wav"><a href="./UniTTS/demos/voice0/infer_0.wav">Download</a></audio> |
| <audio controls src="./UniTTS/demos/voice1/system_audio.wav"><a href="./UniTTS/demos/voice1/system_audio.wav">Download</a></audio> | 天啊！这竟然是真的？我简直不敢相信！<br/>Oh my god! This is actually true? I can't believe it! | <audio controls src="./UniTTS/demos/voice1/infer_1_1.wav"><a href="./UniTTS/demos/voice1/infer_1_1.wav">Download</a></audio> |
| <audio controls src="./UniTTS/demos/voice2/system_audio.wav"><a href="./UniTTS/demos/voice2/system_audio.wav">Download</a></audio> | 立刻停止你的行为！这是最后的警告！<br/>Cease your actions immediately! This is the final warning! | <audio controls src="./UniTTS/demos/voice2/infer_2_1.wav"><a href="./UniTTS/demos/voice2/infer_2_1.wav">Download</a></audio> |
| <audio controls src="./UniTTS/demos/voice3/system_audio.wav"><a href="./UniTTS/demos/voice3/system_audio.wav">Download</a></audio> | 天啊！这绝对是我见过最不可思议的画面！<br/>Good heavens! This is by far the most incredible scene I've ever witnessed! | <audio controls src="./UniTTS/demos/voice3/infer_3_1.wav"><a href="./UniTTS/demos/voice3/infer_3_1.wav">Download</a></audio> |
| <audio controls src="./UniTTS/demos/voice4/system_audio.wav"><a href="./UniTTS/demos/voice4/system_audio.wav">Download</a></audio> | 你怎么能这样对我？我简直无法忍受！<br/>How could you do this to me? I simply can't bear it! | <audio controls src="./UniTTS/demos/voice4/infer_4_1.wav"><a href="./UniTTS/demos/voice4/infer_4_1.wav">Download</a></audio> |
| <audio controls src="./UniTTS/demos/voice5/system_audio.wav"><a href="./UniTTS/demos/voice5/system_audio.wav">Download</a></audio> | 今天的阳光真温暖，公园里的花开得特别灿烂！！<br/>The sunshine is so warm today, and the flowers in the park are blooming brilliantly! | <audio controls src="./UniTTS/demos/voice5/infer_5_1.wav"><a href="./UniTTS/demos/voice5/infer_5_1.wav">Download</a></audio> |
| <audio controls src="./UniTTS/demos/voice6/system_audio.wav"><a href="./UniTTS/demos/voice6/system_audio.wav">Download</a></audio> | 可是，她有一个不太好看的孩子，这个孩子被送到了挖沟工人的老婆家里抚养。而安妮·莉斯贝自己呢，住进了伯爵的公馆。<br/>However, she had an unattractive child who was sent to be raised by the wife of a ditch digger. As for Anne Lisbeth herself, she moved into the count's mansion. | <audio controls src="./UniTTS/demos/voice6/infer_6_1.wav"><a href="./UniTTS/demos/voice6/infer_6_1.wav">Download</a></audio> |
| <audio controls src="./UniTTS/demos/voice7/system_audio.wav"><a href="./UniTTS/demos/voice7/system_audio.wav">Download</a></audio> | 求求你…不要离开我，我真的好害怕…<br/>Please... don't leave me, I'm so scared... | <audio controls src="./UniTTS/demos/voice7/infer_7_1.wav"><a href="./UniTTS/demos/voice7/infer_7_1.wav">Download</a></audio> |
| <audio controls src="./UniTTS/demos/voice8/system_audio.wav"><a href="./UniTTS/demos/voice8/system_audio.wav">Download</a></audio> | 当我看到那双眼睛时，仿佛整个宇宙都安静了下来。<br/>When I saw those eyes, it felt as if the entire universe fell silent. | <audio controls src="./UniTTS/demos/voice8/infer_8_1.wav"><a href="./UniTTS/demos/voice8/infer_8_1.wav">Download</a></audio> |
| <audio controls src="./UniTTS/demos/voice9/system_audio.wav"><a href="./UniTTS/demos/voice9/system_audio.wav">Download</a></audio> | 听到这个消息，我的心一下子沉到了谷底。<br/>Upon hearing this news, my heart sank to the deepest abyss. | <audio controls src="./UniTTS/demos/voice9/infer_9_1.wav"><a href="./UniTTS/demos/voice9/infer_9_1.wav">Download</a></audio> |
| <audio controls src="./UniTTS/demos/voice10/system_audio.wav"><a href="./UniTTS/demos/voice10/system_audio.wav">Download</a></audio> | 当我看到那双眼睛时，仿佛整个宇宙都安静了下来。<br/>When I saw those eyes, it felt as if the entire universe fell silent. | <audio controls src="./UniTTS/demos/voice10/infer_10_1.wav"><a href="./UniTTS/demos/voice10/infer_10_1.wav">Download</a></audio> |




## Install
**Clone and Install**

- Installation environment
``` sh
conda create -n unitts -y python=3.10
conda activate unitts
pip install -r requirements.txt
```
**Training Usage**

We have open-sourced our three-stage training code, including pre-training, SFT, and LPO. Our training code is built upon the pai-megatron-patch with optimizations. For usage instructions on pre-training and SFT training, please refer to the following [README](./UniTTS/pai-megatron-patch/examples/qwen2_5/README.md).


**Inference Usage**

Direct inference can be executed with the following script
``` sh
cd cli
sh run_evalation.sh
```
Or you can also run it directly using the following Python command
```
python inference.py \
    --model_config $model_config \
    --ckpt_config $ckpt_config \
    --model_name $model_path \
    --output_dir $output_dir \
    --temperature $temperature \
    --top_p $top_p \
    --seed $seed \
    --text $text \
    --ref_text $ref_text \
    --ref_audio_path $ref_audio_path \
```

## References
The UniTTS model underwent a three-phase training paradigm consisting of pretraining, SFT, and DPO. Our training framework was developed through extensive customization of the open-source PAI-Megatron-Patch infrastructure. The training data underwent rigorous preprocessing utilizing open-source speech processing tools including FunASR and Whisper, which implemented advanced audio cleansing techniques such as voice activity detection and silence removal algorithms to ensure data quality.

[1] [pai-megagtron-patch](https://github.com/alibaba/Pai-Megatron-Patch/tree/main)

[2][FunASR](https://github.com/modelscope/FunASR)

[3][whisper](https://github.com/openai/whisper)


## Disclaimer

Our model provides zero-shot voice cloning services only for academic research purposes. We encourage the community to uphold safety and ethical principles in AI research and applications.

Important Notes:

- Compliance with the model's open-source license is mandatory.

- Unauthorized voice replication applications are strictly prohibited.

- Developers bear no responsibility for any misuse of this model.


# Supplementary of DistilCodec
## Demos of DistilCodec
The MOS evaluation dataset comprises original audio samples stored in the [Original Audios](./DistilCodec/data/org_audios/) directory and corresponding reconstructed samples generated by DistilCodec in the [Reconstructed Audios](./DistilCodec/data/gen_audios). Below are comparative analyses between selected original and reconstructed audio pairs:
| Category        | Original Audio | Reconstructed Aduio   |
|---------------------------|------------------|-------|
| Chinese Audio    |<audio controls src="./DistilCodec/data/org_audios/0b0c96e3-e2ae-45a3-9488-806cd719517b_0175.wav">Your browser does not support audio playback. Please download <a href="./DistilCodec/data/org_audios/0b0c96e3-e2ae-45a3-9488-806cd719517b_0175.wav">audio file</a>。</audio>|<audio controls src="./DistilCodec/data/gen_audios/0b0c96e3-e2ae-45a3-9488-806cd719517b_0175.wav">Your browser does not support audio playback. Please download <a href="./DistilCodec/data/gen_audios/0b0c96e3-e2ae-45a3-9488-806cd719517b_0175.wav">audio file</a>。</audio>|
| Chinese Audio    |<audio controls src="./DistilCodec/data/org_audios/0d28f03f-70c8-4180-ba1c-37b167aa9447_0074.wav">Your browser does not support audio playback. Please download <a href="./DistilCodec/data/org_audios/0d28f03f-70c8-4180-ba1c-37b167aa9447_0074.wav">audio file</a>。</audio>|<audio controls src="./DistilCodec/data/gen_audios/0d28f03f-70c8-4180-ba1c-37b167aa9447_0074.wav">Your browser does not support audio playback. Please download <a href="./DistilCodec/data/gen_audios/0d28f03f-70c8-4180-ba1c-37b167aa9447_0074.wav">audio file</a>。</audio>|
| Chinese Audio    |<audio controls src="./DistilCodec/data/org_audios/0eff38a1-3c9c-4a33-9be9-896614417d3f_0081.wav">Your browser does not support audio playback. Please download <a href="./DistilCodec/data/org_audios/0eff38a1-3c9c-4a33-9be9-896614417d3f_0081.wav">audio file</a>。</audio>|<audio controls src="./DistilCodec/data/gen_audios/0eff38a1-3c9c-4a33-9be9-896614417d3f_0081.wav">Your browser does not support audio playback. Please download <a href="./DistilCodec/data/gen_audios/0eff38a1-3c9c-4a33-9be9-896614417d3f_0081.wav">audio file</a>。</audio>|
| English Audio    |<audio controls src="./DistilCodec/data/org_audios/f0b1da30-ad19-4619-8aee-4b5c6d8c4acf_POD0000003287_S0000341.wav">Your browser does not support audio playback. Please download <a href="./DistilCodec/data/org_audios/f0b1da30-ad19-4619-8aee-4b5c6d8c4acf_POD0000003287_S0000341.wav">audio file</a>。</audio>|<audio controls src="./DistilCodec/data/gen_audios/f0b1da30-ad19-4619-8aee-4b5c6d8c4acf_POD0000003287_S0000341.wav">Your browser does not support audio playback. Please download <a href="./DistilCodec/data/gen_audios/f0b1da30-ad19-4619-8aee-4b5c6d8c4acf_POD0000003287_S0000341.wav">audio file</a>。</audio>|
| English Audio    |<audio controls src="./DistilCodec/data/org_audios/0016.wav">Your browser does not support audio playback. Please download <a href="./DistilCodec/data/org_audios/0016.wav">audio file</a>。</audio>|<audio controls src="./DistilCodec/data/gen_audios/0016.wav">Your browser does not support audio playback. Please download <a href="./DistilCodec/data/gen_audios/0016.wav">audio file</a>。</audio>|
| English Audio    |<audio controls src="./DistilCodec/data/org_audios/2f7f51c9-c514-4a23-8c31-d032c929df46_YOU0000006574_S0000379.wav">Your browser does not support audio playback. Please download <a href="./DistilCodec/data/org_audios/2f7f51c9-c514-4a23-8c31-d032c929df46_YOU0000006574_S0000379.wav">audio file</a>。</audio>|<audio controls src="./DistilCodec/data/gen_audios/2f7f51c9-c514-4a23-8c31-d032c929df46_YOU0000006574_S0000379.wav">Your browser does not support audio playback. Please download <a href="./DistilCodec/data/gen_audios/2f7f51c9-c514-4a23-8c31-d032c929df46_YOU0000006574_S0000379.wav">audio file</a>。</audio>|

For additional comparative audio examples, please use our MOS evaluation tool:
```bash
python ./DistilCodec/codec_evaluation_gradio.py
```
Upon launching the system, the interface displays the following components: Model1 represents the original audio, while Model2 corresponds to the audio reconstructed by DistilCodec.
![DistilCodec MOS Tool](./DistilCodec/data/distilcodec_mos.png)

If you want to perform a benchmark evaluation on LibriSpeech-test, you can follow these steps:
- *Eval Config*: Modify the values of parameters in [Eval Cofig](./DistilCodec/scripts/examples/evaluation/libri_test_clean.json), such as filelist_path, save_dir.
- *Eval Shell*: Modify the values of parameters in [Eval Shell](./DistilCodec/scripts/examples/evaluation/libri_test_clean_eval.sh).
- *Execute Shell*: Run the eval shell.

## Installation of DistilCodec
-*Step1*: Create conda environment for DistilCodec.
```bash
conda create -n distilcodec python=3.10
conda activate distilcodec
```
-*Step2*: install requirements.
```bash
pip install requirements.txt
```

## Inference of DistilCodec

### Part1:  Reconstruct audio from raw audio

```python

from distil_codec import DistilCodec, demo_for_generate_audio_codes

codec_model_config_path='/path/to/distilcodec/model_config.json'
codec_ckpt_path = '/path/to/distilcodec_ckpt'
step=204000

codec = DistilCodec.from_pretrained(
    config_path=codec_model_config_path,
    model_path=codec_ckpt_path,
    load_steps=step,
    use_generator=True,
    is_debug=False).eval()

audio_path = '/path/to/audio_file'
audio_tokens = demo_for_generate_audio_codes(
    codec, 
    audio_path, 
    target_sr=24000, 
    plus_llm_offset=True # If this parameter set to True, then it will add LLM's vocabulary number to audio token, and DistilCodec's default vocabulary number is from QWen2.5-7B.
)
print(audio_tokens)

```

### Part2: Reconstruct audio from raw audio
```python

from distil_codec import DistilCodec, demo_for_generate_audio_codes

codec_model_config_path='/path/to/distilcodec/model_config.json'
codec_ckpt_path = '/path/to/distilcodec_ckpt'
step=204000

codec = DistilCodec.from_pretrained(
    config_path=codec_model_config_path,
    model_path=codec_ckpt_path,
    load_steps=step,
    use_generator=True,
    is_debug=False).eval()

audio_path = '/path/to/audio_file'
audio_tokens = demo_for_generate_audio_codes(
    codec, 
    audio_path, 
    target_sr=24000, 
    plus_llm_offset=True # If this parameter set to True, then it will add LLM's vocabulary number to audio token, and DistilCodec's default vocabulary number is from QWen2.5-7B.
)
print(audio_tokens)

# Generated audio save path, the path is f'{gen_audio_save_path}/{audio_name}.wav'
gen_audio_save_path = '/path/to/audio_save_path'
audio_name = 'audio_name'
y_gen = codec.decode_from_codes(
    audio_tokens, 
    minus_token_offset=True # if the 'plus_llm_offset' of method demo_for_generate_audio_codes is set to True, then minus_token_offset must be True.
)
codec.save_wav(
    audio_gen_batch=y_gen, 
    nhop_lengths=[y_gen.shape[-1]], 
    save_path=gen_audio_save_path,
    name_tag=audio_name
)

```

## Training of DistilCodec

### Step1: Prepare train dataset
Prepare audio segments like [Audio Examples for Traing](./DistilCodec/data/training_data_demos/). The audio setting is shown in below table:
| Duration(s) | Sampling Rate(Hz)| Audio Category |
|-----------------------|---------------------|---------------|
| 2s ~ 10s | 24000 | Universal audio (Speech, Audiobook, Audio Effects etc.) |

### Step2: Modifying configuration files
- *Train Config*: Modify the values of parameters in [Train Cofig](./DistilCodec/scripts//examples/train/train_config.json), such as batch_size, filelist_path, save_dir.
- *Model Config*: Modify the values of parameters in [Model Cofig](./DistilCodec/scripts//examples/train/model_config.json).
- *Train Shell*: Modify the values of parameters in [Train Shell](./DistilCodec/scripts//examples/evaluation/common_eval.json).

### Step3: Start training process
Execute training shell if you can use slurm:
```bash
sbatch ./path/to/train.sh
```
if you don't use slurm, then you can execute the training:
```bash
sh ./path/to/train.sh
```

## References
The overall training pipeline of DistilCodec draws inspiration from AcademiCodec, while its encoder and decoder design is adapted from fish-speech. The Vector Quantization (VQ) component implements GRFVQ using the vector-quantize-pytorch framework. These three exceptional works have provided invaluable assistance in our implementation of DistilCodec. Below are links to these reference projects:

[1][vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch)

[2][AcademiCodec](https://github.com/moewiee/hificodec)

[3][fish-speech](https://github.com/fishaudio/fish-speech)