# DASpeech

This is the PyTorch implementation of the paper `DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation`.

**Abstract**: DASpeech is a non-autoregressive direct S2ST model which realizes both *fast* and *high-quality* S2ST. To better capture the multimodal distribution of the target speech, DASpeech adopts the two-pass architecture to decompose the generation process into two steps, where a linguistic decoder first generates the target text, and an acoustic decoder then generates the target speech based on the hidden states of the linguistic decoder. Specifically, we use the decoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as the acoustic decoder. DA-Transformer models translations with a directed acyclic graph (DAG). To consider all potential paths in the DAG during training, we calculate the expected hidden states for each target token via dynamic programming, and feed them into the acoustic decoder to predict the target mel-spectrogram. During inference, we select the most probable path and take hidden states on that path as input to the acoustic decoder. DASpeech successfully achieves both high-quality translations and fast decoding speeds for S2ST.

**Audio samples are available at [https://daspeech.github.io/](https://daspeech.github.io/)**.

![daspeech](assets/daspeech.png)



## Requirements & Installation

- python >= 3.8
- pytorch == 1.13.1 (cuda == 11.3)
- torchaudio == 0.13.1
- gcc >= 7.0.0
- Install fairseq via `pip install -e fairseq/`.



## Data Preparation

1. Download the [CVSS Dataset](https://github.com/google-research-datasets/cvss).
2. Extract the mel-filterbank features of the source speech.
3. Perform forced alignment for the target speech using [Montreal Forced Aligner](https://mfa-models.readthedocs.io/en/latest/#). From the alignment results, we can obtain the target phoneme sequence and durations of each phoneme.
4. Extract the mel-spectrogram, as well as the pitch and energy information of the target speech.
5. Format the data as follows:


```
id	src_audio	src_n_frames	tgt_text	tgt_audio	tgt_n_frames	duration	pitch	energy
common_voice_fr_17732749	data/cvss-c/x-en/src_fbank80.zip:152564:126208	394	M AE1 D AH0 M spn DH AH0 B EH1 R AH0 N IH0 S	data/cvss-c/fr-en/tts/logmelspec80.zip:9649913382:42048	131	6 10 3 6 7 41 1 4 6 8 6 4 4 8 17	data/cvss-c/fr-en/tts/pitch.zip:56518652:248	data/cvss-c/fr-en/tts/energy.zip:39972962:188
common_voice_fr_17732750	data/cvss-c/x-en/src_fbank80.zip:112447740355:226048	706	Y UW1 N OW1 EH1 Z W EH1 L EH1 Z AY1 D UW1 DH AE1 T M EH1 N IY0 N UW1 M AA1 L AH0 K Y UW2 L Z AE1 V AH0 N F AO1 R CH AH0 N AH0 T L IY0 B IH1 N D IH2 S AH0 P OY1 N T IH0 NG	data/cvss-c/fr-en/tts/logmelspec80.zip:9565416692:131328	410	10 4 6 15 6 7 5 6 7 5 6 14 8 20 12 12 4 15 6 3 8 5 34 13 5 3 3 7 4 2 3 3 3 3 2 6 11 4 6 9 4 3 5 3 4 7 7 5 5 5 3 6 5 8 9 4 5 7 10	data/cvss-c/fr-en/tts/pitch.zip:56028170:600	data/cvss-c/fr-en/tts/energy.zip:39626736:364
common_voice_fr_17732751	data/cvss-c/x-en/src_fbank80.zip:117843863169:129408	404	OW1 B IH0 K AH1 Z N AW1 HH W EH1 N W IY1 T AO1 K AH0 B AW1 T D R IH1 NG K IH0 NG AY1 L IY1 V	data/cvss-c/fr-en/tts/logmelspec80.zip:6059194182:112128	350	31 24 5 8 10 8 6 31 26 5 4 5 4 6 11 10 7 3 5 9 5 6 3 5 4 7 7 11 41 10 22 11	data/cvss-c/fr-en/tts/pitch.zip:35504944:384	data/cvss-c/fr-en/tts/energy.zip:25110282:256
...
```



## Model Training

We provide the training scripts in `train_scripts/`. You can run:

```shell
sh train_scripts/train_s2t.sh       # s2t pretraining
sh train_scripts/train_tts.sh       # tts pretraining
sh train_scripts/train_s2s.sh       # s2s finetuning
```

All models are trained on 4 RTX 3090 GPUs. You can adjust the `--update-freq` depending on the number of your available GPUs.



## Evaluation

We provide the evaluation scripts in `test_scripts/`. You can run:

```shell
sh test_scripts/generate.fr-en.lookahead.vctk.sh cvss-c.fr-en.daspeech
```

For *Joint-Viterbi* decoding, please select the value of parameter $\beta$ according to the performance on the `dev` set.

