{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copyright 2019 NVIDIA Corporation. All Rights Reserved.\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License.\n",
    "# ============================================================================== "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"img/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
    "\n",
    "# Generate Natural Sounding Speech From Text In Real Time Using Tacotron 2 And WaveGlow v1.6 For PyTorch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model overview\n",
    "\n",
    "This text-to-speech (TTS) system is a combination of two neural network models:\n",
    "  * a modified Tacotron 2 model from the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper (https://arxiv.org/abs/1712.05884) and\n",
    "  * a flow-based neural network model from the WaveGlow: A Flow-based Generative Network for Speech Synthesis paper (https://arxiv.org/abs/1811.00002).\n",
    "\n",
    "The Tacotron 2 and WaveGlow model form a text-to-speech system that enables user to synthesise a natural sounding speech from raw transcripts without any additional prosody information.\n",
    "\n",
    "Our implementation of Tacotron 2 model differs from the model described in the paper. Our implementation uses Dropout instead of Zoneout to regularize the LSTM layers. Also, the original text-to-speech system proposed in the paper used the WaveNet model to synthesize waveforms. In our implementation, we use the WaveGlow model for this purpose.\n",
    "\n",
    "Both models are based on implementations of NVIDIA GitHub repositories Tacotron 2 and WaveGlow, and are trained on a publicly available LJ Speech dataset (https://keithito.com/LJ-Speech-Dataset/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Requirements\n",
    "\n",
    "Please see 'notebooks/README.md'."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quick Start Guide\n",
    "\n",
    "To train your model using mixed precision with tensor cores or using FP32, perform the following steps using the default parameters of the Tacrotron 2 and WaveGlow model on the LJ Speech dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Download and preprocess the dataset:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the ./scripts/prepare_dataset.sh download script to automatically download and preprocess the training, validation and test datasets. To run this script, issue:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!bash scripts/prepare_dataset.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data is downloaded to the ./LJSpeech-1.1 directory (on the host). The ./LJSpeech-1.1 directory is mounted to the /workspace/tacotron2/LJSpeech-1.1 location in the NGC container. The preprocessed mel-spectrograms are stored in the ./LJSpeech-1.1/mels directory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Build the Tacotron 2 and WaveGlow PyTorch NGC container:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!docker build . --rm -t tacotron2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#check the container image that you just built \n",
    "!docker images"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Start a detached session in the NGC container to run training and inference:\n",
    "\n",
    "After you build the container image, you can start to run your container, using either single GPU or multiple GPUs, by setting the NV_GPU variable at the Docker container launch. For your reference, you can look into docker commands and options at: https://docs.docker.com/engine/reference/commandline/docker/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#for single GPU, specify your GPU to run the container\n",
    "!NV_GPU=2 nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it -d --rm --name \"myTacotron2\" --ipc=host -v $PWD:/workspace/tacotron2/ tacotron2 bash"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#for multiple GPU\n",
    "!nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it -d --rm --name \"myTacotron2\" --ipc=host -v $PWD:/workspace/tacotron2/ tacotron2 bash"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#check the container that you just started\n",
    "!docker ps -a"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To preprocess the datasets for Tacotron 2 training, use the scripts/prepare_mels.sh script:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvidia-docker exec -it myTacotron2 bash scripts/prepare_mels.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The preprocessed mel-spectrograms are stored in the ./LJSpeech-1.1/mels directory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Start training:\n",
    "The Tacotron2 and WaveGlow models are trained separately and independently. Both models obtain mel spectrograms from short time Fourier transform (STFT) during training. These mel spectrograms are used for loss computation in case of Tacotron 2 and as conditioning input to the network in case of WaveGlow.\n",
    "\n",
    "The training loss is averaged over an entire training epoch, whereas the validation loss is averaged over the validation dataset. Performance is reported in total input tokens per second for the Tacotron 2 model, and in total output samples per second for the WaveGlow model. Both measures are recorded as train_iter_items/sec (after each iteration) and train_epoch_items/sec (averaged over epoch) in the output log. The result is averaged over an entire training epoch and summed over all GPUs that were included in the training.\n",
    "\n",
    "By default, the train_tacotron2.sh and train_waveglow.sh scripts will launch mixed precision training with tensor cores. You can change this behaviour by removing the --amp flag from the train.py script.\n",
    "\n",
    "To run Tacotron 2 training:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#For single GPU \n",
    "!nvidia-docker exec -it myTacotron2 python train.py -m Tacotron2 -o output/ -lr 1e-3 --epochs 1500 -bs 80 --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file output/nvlog.json --anneal-steps 500 1000 1500 --anneal-factor 0.1 --amp "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#For multiple GPUs\n",
    "!nvidia-docker exec -it myTacotron2 python -m multiproc train.py -m Tacotron2 -o output/ -lr 1e-3 --epochs 1500 -bs 80 --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled. I a --log-file output/nvlog.json --anneal-steps 500 1000 1500 --anneal-factor 0.1 --amp "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run WaveGlow training:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#For single GPU\n",
    "!nvidia-docker exec -it myTacotron2 python train.py -m WaveGlow -o output/ -lr 1e-4 --epochs 1000 -bs 10 --segment-length  8000 --weight-decay 0 --grad-clip-thresh 65504.0 --epochs-per-checkpoint 50 --cudnn-enabled --cudnn-benchmark --log-file output/nvlog.json --amp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#For multiple GPUs\n",
    "!nvidia-docker exec -it myTacotron2 python -m multiproc train.py -m WaveGlow -o output/ -lr 1e-4 --epochs 1000 -bs 10 --segment-length  8000 --weight-decay 0 --grad-clip-thresh 65504.0 --epochs-per-checkpoint 50 --cudnn-enabled --cudnn-benchmark --log-file output/nvlog.json --amp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Start validation/evaluation.\n",
    "\n",
    "Ensure your loss values are comparable to those listed in the table below:\n",
    "\n",
    "\n",
    "| Loss (Model/Epoch) |       1 |     250 |     500 |     750 |    1000 |\n",
    "| :----------------: | ------: | ------: | ------: | ------: | ------: |\n",
    "| **Tacotron 2 mixed precision** | 13.0732 |   0.5736 |  0.4408 |  0.3923 |  0.3735 |\n",
    "| **Tacotron 2 FP32** |  8.5776 |  0.4807 |  0.3875 |  0.3421 |  0.3308 |\n",
    "| **WaveGlow mixed precision**  | -2.2054 | -5.7602 |  -5.901 | -5.9706 | -6.0258 |\n",
    "| **WaveGlow FP32**  | -3.0327 |  -5.858 | -6.0056 | -6.0613 | -6.1087 |\n",
    "\n",
    "For both models, the loss values are stored in the /output/nvlog.json log file.\n",
    "\n",
    "After you have trained the Tacotron 2 model for 1500 epochs and the WaveGlow model for 1000 epochs, you should get audio results similar to the samples in the /audio folder. \n",
    "\n",
    "For details about generating audio, see the Inference section below.\n",
    "\n",
    "The training scripts automatically run the validation after each training epoch. The results from the validation are printed to the standard output (stdout) and saved to the log files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#download pre-trained Tacotron2 FP32 model\n",
    "!curl -L https://developer.nvidia.com/joc-tacotron2-fp32-pyt-20190306 > JoC_Tacotron2_FP32_PyT_20190306"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "#download pre-trained Tacotron2 FP16 model\n",
    "!curl -L https://developer.nvidia.com/joc-tacotron2-fp16-pyt-20190306 > JoC_Tacotron2_FP16_PyT_20190306"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#download pre-trained WaveGlow FP32 model\n",
    "!curl -L https://developer.nvidia.com/joc-waveglow-fp32-pyt-20190306 > JoC_WaveGlow_FP32_PyT_20190306"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#download pre-trained WaveGlow FP16 model\n",
    "!curl -L https://developer.nvidia.com/joc-waveglow-fp16-pyt-20190306 > JoC_WaveGlow_FP16_PyT_20190306"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. Start inference:\n",
    "\n",
    "After you have trained your Tacotron 2 and WaveGlow models, or downloaded the pre-trained checkpoints for the respective models, you can perform inference which takes text as input, and produces an audio file. If you use pre-trained checkpoints instead of training, please create output folder as below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You need to create the input file with some text and put it in the current directory, or just input the text in the below cell: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile text.txt\n",
    "William Shakespeare was an English poet, playwright and actor,widely regarded as the greatest writer in the English language and the world's greatest dramatist. He is often called England's national poet and the \"Bard of Avon\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's verify that the file has been actually written:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ls -l text.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can run inference using the inference.py script, the respective checkpoints that are passed as --tacotron2 and --waveglow arguments, Tacotron2_checkpoint and WaveGlow_checkpoint are pre-trained checkpoints for the respective models.\n",
    "\n",
    "You can customize the content of the text file, depending on its length, you may need to increase the --max-decoder-steps option to 2000*. \n",
    "\n",
    "*The Tacotron 2 model was trained on LJSpeech dataset with audio samples no longer than 10 seconds, which corresponds to about 860 mel-spectrograms. Therefore the inference is expected to work well with generating audio samples of similar length. We set the mel-spectrogram length limit to 2000 (about 23 seconds), since in practice it still produces correct voice. If needed, users can split longer phrases into multiple sentences and synthesize them separately.\n",
    "\n",
    "The speech is generated from text file passed with -i argument. \n",
    "\n",
    "The output audio will be stored in the path specified by -o argument.\n",
    "\n",
    "To run inference in mixed precision, use --fp16 flag:   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvidia-docker exec -it myTacotron2 python inference.py --tacotron2 JoC_Tacotron2_FP16_PyT_20190306 --max-decoder-steps 2000 --waveglow JoC_WaveGlow_FP32_PyT_20190306 -o output/ --include-warmup -i text.txt --fp16"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can open the output audio file and start listening:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#output audio file using mixed precision\n",
    "import IPython.display as ipd\n",
    "ipd.Audio('./output/audio_0.wav', rate=22050)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run inference using FP32, simply remove --fp16 flag: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvidia-docker exec -it myTacotron2 python inference.py --tacotron2 JoC_Tacotron2_FP16_PyT_20190306 --max-decoder-steps 2000 --waveglow JoC_WaveGlow_FP32_PyT_20190306 -o output/fp32 --include-warmup -i text.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can compare the quality of the generated audio files, using mixed precision and FP32, and you will find out that using mixed precision, you can generate natual sounding and high quality audio without noise in real time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#output audio file using FP32\n",
    "import IPython.display as ipd\n",
    "ipd.Audio('./output/fp32audio_0.wav', rate=22050)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#stop your container in the end\n",
    "!docker stop myTacotron2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7. Next Step:\n",
    "\n",
    "Now you have learnt how to generate high quality audio from text using Tacotron 2 and WaveGlow 1.6, you can experiment with more input texts, or change the hyperparameters of the models, such as epoch number, learning rate, different precisions, etc, to see if they could improve the training and inference results.\n",
    "\n",
    "If you are interested in learning more, please check our Github repositories to find more examples (https://github.com/NVIDIA/DeepLearningExamples). "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
