# EmoSpeech
Official implementation of EmoSpeech 🤪 SSW12


## How to run

### Build env

You can build an environment with `Docker` or `Conda`.
#### To set up environment with Docker

If you don't have Docker installed, please follow the links to find installation instructions for [Ubuntu](https://docs.docker.com/desktop/install/linux-install/), [Mac](https://docs.docker.com/desktop/install/mac-install/) or [Windows](https://docs.docker.com/desktop/install/windows-install/).

      bash run_docker.sh
      
#### To set up environment with Conda
If you don't have Conda installed,  please find the installation instructions for your OS [here](https://docs.conda.io/en/latest/miniconda.html).

      conda create -n etts python=3.8
      conda activate etts
      pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116  
      pip3 install -r requirements.txt

If you have different version of cuda on your machine you can find applicable link for pytorch installation [here](https://pytorch.org/get-started/previous-versions/).


### Download and preprocess data
We used data of 10 English Speakers from [ESD dataset](https://github.com/HLTSingapore/Emotional-Speech-Data). To download all `.wav`, `.txt` files along with `.TextGrid` files created using [MFA](https://github.com/MontrealCorpusTools/mfa-models):

      bash download_data.sh
 
To train a model we need precomputed durations, energy, pitch and eGeMap features. From `src` directory run:

      python3 -m preprocess
      
This is how your data folder should look like:


      .
      ├── data
      │   ├── ssw_esd
      │   ├── test_ids.txt
      │   ├── val_ids.txt
      └── └── preprocessed
              ├── duration
              ├── egemap
              ├── energy
              ├── mel
              ├── phones.json
              ├── pitch
              ├── stats.json
              ├── test.txt
              ├── train.txt
              ├── trimmed_wav
              └── val.txt
        

                
 We are all set for model training 🎉
                
## References
1. [FastSpeech 2 - PyTorch Implementation](https://github.com/ming024/FastSpeech2)
2. [iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform](https://github.com/rishikksh20/iSTFTNet-pytorch)
3. [Publicly Available Emotional Speech Dataset (ESD) for Speech Synthesis and Voice Conversion](https://github.com/HLTSingapore/Emotional-Speech-Data)
4. [NISQA: Speech Quality and Naturalness Assessment](https://github.com/gabrielmittag/NISQA)
