# UniSpeech

This is the official implementation of paper "[UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597)". The implementation mainly based on [fairseq](https://github.com/pytorch/fairseq) codebase.  We release the training recipes on CommonVoice dataset.

## Requirements and Installation

 - Pytorch >= 1.6.0
 - python version >= 3.6
 ``` bash
 cd src
 pip install soundfile
 pip install librosa
 pip install pydub
 pip install --editable ./
 ```
## Data Preparation
Download pretraining audio data from [here](https://commonvoice.mozilla.org/datasets). (We use the June 2020 release version in our paper). 
Get the wav list and the transcription for each dataset by run:
```
python examples/unispeech/unispeech_manifest.py input_meta_file --dest examples/unispeech/data/LANG 
```

Then convert the audio files in common voices to 16k HZ using the commond:
```
python examples/unispeech/adjust_sample_rate.py --wav-path /path/to/wav/ --dest-path /path/to/16kwav/ --input examples/unispeech/data/LANG/*.tsv --output examples/unispeech/data/LANG/*_16k.tsv
```
For the finetuning data, our train/val/test splits are following [this](https://dl.fbaipublicfiles.com/cpc_audio/common_voices_splits.tar.gz).
The phoneme transcriptions are generated by [phonemizer](https://github.com/bootphon/phonemizer) to convert texts to phonemes. Then we create .id files using different vocabularies. All our pre-processed data as well as the dictionaries can be downloaded from [here]. 

## Pretraining

We give the training examples for large model here.
### Stage 1. Pretraining UniSpeech with labeled data.
The following script can be used to pre-train an English model:
```
bash examples/unispeech/scripts/one2one_large_pretrain_en1350.sh
```
To train a multilingual model:
```
bash examples/unispeech/scripts/multilingual_large_pretrain.sh
```

### Stage 2. Continue pre-training with low-resource unlabeled data. (Optional)
After stage 1, you can continue pre-training the UniSpeech model with only contrastive loss:
```
bash examples/unispeech/scripts/continue_pretran.sh
```

### Stage 3. Finetuning with low-resource labeled data.
Finally, fint-tune the model with 1 hour labeled data.
For multilingual models, you can choose to use separate vocabulary (examples/unispeech/data/en/vocab_sep.json) or shared vocabulary (examples/unispeech/data/en/vocab_share.json)
```
bash examples/unispeech/scripts/finetune.sh
```


