# HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis

![HALL-E Architecture](./assets/fig_halle_v2.png)

**HALL-E** is an LLM-based Text-to-Speech (TTS) model designed for generating long-form speech, built upon VALL-E ([Wang et al., 2023](https://arxiv.org/abs/2301.02111)).

## Abstract

Recent advancements in TTS models leveraging large language models (LLMs) to translate natural language text into sequences of discrete audio tokens have garnered significant research attention, particularly with the development of Neural Audio Codec (NAC) models utilizing Residual Vector Quantization (RVQ). However, synthesizing long-form speech remains a substantial challenge due to the high frame rate, which increases the length of audio tokens and complicates the generation process for autoregressive language models.

To address this challenge, HALL-E introduces two novel post-training approaches:

1. **Multi-Resolution Requantization (MReQ)**: A framework designed to reduce the frame rate of pre-trained NAC models. MReQ incorporates a Multi-Resolution Residual Vector Quantization (MRVQ) module that hierarchically reorganizes discrete audio tokens through teacher-student distillation.
2. **HALL-E**: An LLM-based TTS model specifically developed to predict the hierarchical tokens generated by MReQ. It leverages MRVQ sub-modules and continues training from a pre-trained LLM-based TTS model.

Additionally, to promote TTS research, we have created **MinutesSpeech**, a new benchmark dataset consisting of 40,000 hours of filtered speech data. This dataset facilitates the training and evaluation of speech synthesis systems for utterances ranging from 3 seconds up to 180 seconds.

In our experiments, we demonstrated the effectiveness of our approaches by applying the post-training framework to VALL-E. We successfully reduced the frame rate to as low as 8 Hz, enabling stable minute-long speech synthesis in a single inference step.

## Demo

Experience HALL-E in action! Visit our [Demo Page](https://anonymous.4open.science/w/halle_demo/) to see the capabilities of our model.

## Dataset: MinutesSpeech

We provide **MinutesSpeech**, the dataset proposed in our paper, for research purposes. You can download it from the following link:

- [MinutesSpeech Dataset](https://drive.google.com/drive/folders/1ccSlIt2hs8ea5K6FXoygYCKcdnhJafja?usp=drive_link)

**Note**:
- The `MinutesSpeech_train` subset does **not** include audio files; it contains only transcripts.
- If you wish to use the training data, please download the corresponding audio files yourself from [Podcast Index](https://podcastindex.org/).
- The database IDs published on Podcast Index match the filenames of each JSON transcript file we distribute.

## Installation

This repository is based on [AudioCraft](https://github.com/facebookresearch/audiocraft) and requires the same environment. Specifically, Python 3.10 and PyTorch 2.1.0 are needed. Follow the steps below to install AudioCraft and its dependencies:

```shell
# It is recommended to use Miniconda
# Install PyTorch
python -m pip install 'torch==2.1.0'

# Install setuptools and wheel
python -m pip install setuptools wheel

# Clone this repository and install it
python -m pip install -e .

# For ASR evaluation (if needed)
conda install -c conda-forge gcc=12.1.0 -y
python -m pip install Cython
python -m pip install huggingface-hub==0.23.2 nemo_toolkit[all]==1.23.0
```

We also recommend having `ffmpeg` installed. You can install it via your system package manager or through Miniconda:

```bash
sudo apt-get install ffmpeg
# Or using Miniconda
conda install "ffmpeg<5" -c conda-forge
```

## Models

This repository provides the following models used in the paper:

### NAC Models
- [**Encodec**](./docs/ENCODEC.md) ([Défossez et al., 2022](https://arxiv.org/abs/2210.13438))
- [**SpeechTokenizer**](./docs/SPEECHTOKENIZER.md) ([Zhang et al., 2024](https://arxiv.org/abs/2308.16692))
- [**MReQ-Encodec**](./docs/MREQ_ENCODEC.md) (ours)
- [**MReQ-SpeechTokenizer**](./docs/MREQ_SPEECHTOKENIZER.md) (ours)
  
### LLM-based TTS Models
- [**VALL-E**](./docs/VALLE.md) ([Wang et al., 2023](https://arxiv.org/abs/2301.02111))
- [**HALL-E**](./docs/HALLE.md) (ours)

For more information on each model, check out their respective documentation (click the model name).

Additionally, the list of available pre-trained weights is as follows:


| Model | Pre-trained Weight |
| --- | --- |
| Encodec                 | [Link](https://drive.google.com/drive/folders/1LxOKlUaATOivg-ddCJluIvssUe4higHw?usp=sharing) |
| SpeechTokenizer         | [Link](https://drive.google.com/drive/folders/1jISsM2VLQqGvYEDkcM_6f6Ib4y0SrDKM?usp=sharing) |
| MReQ-Encodec            | [Link](https://drive.google.com/drive/folders/1HaGTLL6esn6DakCd9uGB2f7CrqmAj060?usp=sharing) |
| MReQ-SpeechTokenizer    | [Link](https://drive.google.com/drive/folders/109Y8iSPLwoPKH5xFWZA7kcrTQVjWGpGi?usp=sharing) |

| Model | NAC model | Training Dataset | Pre-trained Weight |
| --- | --- | --- | --- |
| VALL-E | Encodec | MinutesSpeech-28s | [Link](https://drive.google.com/drive/folders/1o-icFCz-0Bam_YR9LFup6W8PXP0R1xRj?usp=sharing) |
| VALL-E | Encodec | MinutesSpeech-54s | [Link](https://drive.google.com/drive/folders/1uOjRt_lrodxhX9t0syfT_wGN08_LYp64?usp=sharing) |
| VALL-E | Encodec | MinutesSpeech-90s | [Link](https://drive.google.com/drive/folders/1-z4YIXs47uUBfoBy14Ry-kx_0pRu39lZ?usp=sharing) |
| VALL-E | Encodec | MinutesSpeech-180s | [Link](https://drive.google.com/drive/folders/1ldBz83dGcS7-40T6ByhnNCZnxFlGQvCd?usp=sharing) |
| HALL-E | MReQ-Encodec | MinutesSpeech-90s | [Link](https://drive.google.com/drive/folders/1jEZdkCasbAK8IRQKbkSEkIQW_XGXcpGO?usp=sharing) |
| HALL-E | MReQ-Encodec | MinutesSpeech-180s | [Link](https://drive.google.com/drive/folders/1ksLYTvsZrUOM_s-7z7GugpkGvtUKhJSs?usp=sharing) |
| VALL-E | SpeechTokenizer | MinutesSpeech-28s | [Link](https://drive.google.com/drive/folders/1ACbOJyPoGIFxF_iqGF3ZOyRWSiueHNM1?usp=sharing) |
| HALL-E | MReQ-SpeechTokenizer | MinutesSpeech-90s | [Link](https://drive.google.com/drive/folders/10Y4-lxht-Qtzu1chrjDvDisvhvIlieDk?usp=sharing) |


## Preprocessing

To prepare your data for training, follow these steps:

1. **Split the Dataset**: Divide your dataset into training, validation, and testing sets.
2. **Create `data.jsonl`**:
   - For **NAC models** (when transcripts are not used):
     ```bash
     python -m audiocraft.data.audio_dataset \
             /path/to/dataset/split \
             egs/your_dataset_name/split/data.jsonl
     ```
   - For **TTS models** (when transcripts are required):
     ```bash
     python -m audiocraft.data.speech_dataset \
             /path/to/dataset/split \
             egs/your_dataset_name/split/data.jsonl \
             --min_utt_sec 3 --max_utt_sec 180 \
             --g2p_tokenizer bpe \
             --min_text_len 33 --max_text_len 1733 \
             --text_history_length 0 --save_g2p
     ```
3. **Finalize Preprocessing**: Ensure all preprocessing steps are completed.

For a more detailed explanation, please refer to the [Preprocessing Documentation](./docs/PREPROCESSING.md).

## Inference

We provide scripts for performing inference with various models located in the `scripts` directory. You can use either pre-trained weights or models you have trained yourself.

**Using Your Own Trained Models**:
- If you utilize your own trained models, you need to convert the saved checkpoints into the inference model format.
- For detailed instructions, refer to the [Inference Documentation](./docs/INFERENCE.md).

## Training

Training your own models is straightforward thanks to AudioCraft's comprehensive training pipelines. Below is a high-level overview of the training process:

### Steps to Train a Model:

1. **Create Team Configuration**: Specify the location for logs and weights during training. An example configuration can be found at `config/teams/example.yaml`.
2. **Create Dataset Configuration**: Define the path to the `data.jsonl` created during preprocessing. An example can be found at `cconfig/dset/audio/speech_podcast_wav.yaml`.
3. **Create Solver Configuration**: Configure the training hyperparameters. An example is provided in `config/solver/compression/encodec_valle_24khz.yaml`.
4. **Run the Training Script**: Use [Dora](https://github.com/facebookresearch/dora) to start the training process. An example script is `run.sh`.

For a detailed guide, visit the [Training Documentation](./docs/TRAINING.md).

## Citation

If you use HALL-E in your research, please cite the following paper:

```bibtex
WIP
```

## License

- **Code**: Released under the [MIT License](./LICENSE).
- **Model Weights**: Released under the [CC-BY-NC 4.0 License](./LICENSE_weights).
- **Dataset**: Released under the [CC BY-NC-ND 4.0 License](./LICENSE_dataset).

The files `audiocraft/modules/tokenizer.py` and `audiocraft/modules/tokenizer.json` in this repository are derived from the code in the [coqui-ai/TTS](https://github.com/coqui-ai/TTS) repository. This code is licensed under the [Mozilla Public License Version 2.0](https://www.mozilla.org/en-US/MPL/2.0/). Please refer to the `LICENSE_MPL-2.0` file for more details.
## Acknowledgements

HALL-E is built upon the [AudioCraft](https://github.com/facebookresearch/audiocraft) and [coqui-ai/TTS](https://github.com/coqui-ai/TTS) repositories. We extend our gratitude to the contributors of AudioCraft and coqui-ai/TTS for their foundational work.

---

*For any questions or support, please open an issue on our [GitHub repository](https://github.com/your-repo/hall-e) or contact us at [your-email@example.com](mailto:your-email@example.com).*

---
