# DurMI: Membership Inference via Duration Loss in Diffusion-Based Text-to-Speech Models

This repository is the official implementation of [DurMI: Membership Inference via Duration Loss in Diffusion-Based Text-to-Speech Models]

In this paper, we introduce DurMI, the first membership inference attack to exploit duration loss in diffusion-based TTS models, providing a simple, efficient, and highly discriminative signal for detecting whether a speech sample was seen during training, outperforming existing methods.
<p align="center">
  <img src="https://github.com/user-attachments/assets/d6b28870-09cf-4921-88a8-d0fef3e5e069" alt="Figure: Overview of DurMI" width="600"/>
  <br/>
  <em>Figure 1: Overview of DurMI. The difference between predicted and ground-truth durations from the aligner is used as a membership signal. DurMI requires only a single forward pass up to the decoder stage (red arrow).</em>
</p>

## Datasets and Pre-trained Checkpoints

We use three datasets—**LJSpeech**, **LibriSpeech**, and **VCTK**—to train the following models: **GradTTS**, **WaveGrad2**, and **VoiceFlow**.

All model–dataset combinations were trained separately. We provide the **pre-trained checkpoints** for each setting. You can download all datasets and checkpoints from here: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.15474571.svg)](https://doi.org/10.5281/zenodo.15474571)
 
For each attack folder (e.g., `attack/GradTTS`, `attack/WaveGrad2`, `attack/VoiceFlow`), you must create a dataset subfolder and download the corresponding dataset into it.

Additionally, **TextGrid** files for each dataset are also provided and can be downloaded from [here](https://drive.google.com/drive/folders/10eUTzOU06gTRMiQPoyw-Yctflms3ZLTJ?usp=sharing).
These files are required for the **WaveGrad2** data preprocessing, which will be explained later.


## GradTTS
Alternatively, you are free to train the models from scratch with your own data, without using the provided checkpoints. To train **GradTTS**, go to the `train/Grad-TTS` directory and follow the setup and training instructions:

### Preprocessing
Firstly, install all Python package requirements:
```train
pip install -r requirements.txt
```

Then, build monotonic_align code (Cython):
```train
cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..
```

Note: code is tested on Python==3.6.9.


### Training
1. Make filelists of your audio data like ones included into `resources/filelists` folder. For single speaker training refer to `ljspeech` filelists and to `libri-tts` filelists for multispeaker.
2. Set experiment configuration in `params.py` file.
```train
python train.py  # if single speaker
python train_multi_speaker.py  # if multispeaker
```

During training all logging information and checkpoints are stored in `YOUR_LOG_DIR`, which you can specify in `params.py` before training.

### Attack
If you have the checkpoints, you can perform the attack. The attack code for GradTTS is located in `attack/gradtts`. To run our method (DurMI), simply execute:
```attack
python attack_durmi.py --checkpoint <path_to_checkpoint> --dataset <dataset_name>
```

If you want to run other baseline attacks (Naive Attack, SecMI, PIA), use the following command:
```attack
cd gradtts/attack
python attack_baseline.py --checkpoint <path_to_checkpoint> --dataset <dataset_name> --attacker_name <attacker_name> --attack_num <attack_num> --interval <interval>
```

The attack code also includes the evaluation process. Once the attack is completed, the evaluation results—including JSON files and graphs—are saved in the same folder.

## WaveGrad2

To train **WaveGrad2**, go to the `train/WaveGrad2` directory and follow the setup and training instructions:

Firstly, install all Python package requirements:
```train
pip install -r requirements.txt
```

Then, run the following command to prepare the data:
```train
python3 prepare_align.py config/LJSpeech/preprocess.yaml
```

### Preprocessing
As described in the paper, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments for each datasets are provided [here](https://drive.google.com/drive/folders/10eUTzOU06gTRMiQPoyw-Yctflms3ZLTJ?usp=sharing).. You have to unzip the files in `preprocessed_data/<dataset_name>/TextGrid/`.

After that, run the preprocessing script by

```
python3 preprocess.py config/LJSpeech/preprocess.yaml
```

Alternately, you can align the corpus by yourself. Download the official MFA package and run
```
./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech
```
or
```
./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech
```

to align the corpus and then run the preprocessing script.
```
python3 preprocess.py config/LJSpeech/preprocess.yaml
```

### Training
After data preprocessing is complete, you can run the training code to start training the model.
```
python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
```

### Attack
If you have the checkpoints, you can perform the attack. The attack code for WaveGrad2 is located in `attack/wavegrad2`. To run our method (DurMI), simply execute:
```attack
python3 attack_wg_durloss.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
```

If you want to run other baseline attacks (Naive Attack, SecMI, PIA), use the following command:
```attack
cd gradtts/attack
python3 attack_wg_baseline.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
```
Make sure to set the `attack_name` variable inside the code accordingly to match the chosen attack method. Replace `attack_name` with one of: naive, secmi, pia or pian.

## VoiceFlow

To train **VoiceFlow**, go to the `train/VoiceFlow` directory and follow the setup and training instructions:
VoiceFlow is tested on python 3.9 on Linux. You can set up the environment with conda
```
# Install required packages
conda create -n vflow python==3.9  # or any name you like
conda activate vflow
pip install -r requirements.txt

# Then, set PATH
source path.sh  # change the env name in it if you don't use "vflow"

# Install monotonic_align for MAS
cd model/monotonic_align
python setup.py build_ext --inplace
```

### Preprocessing
VoiceFlow relies on Kaldi-style data organization. All data description files should be put in subdirectories in data/. See data/ljspeech/example for a basic example. In this example, the following plain text files are necessary:

1. wav.scp: organized as utt /path/to/wav.
2. utts.list: every line specifies an utterance. This can be obtained by cut -d ' ' -f 1 wav.scp > utts.list.
3. utt2spk: organized as utt spk_name.
4. text and phn_duration: specifies the phoneme sequence and the corresponding integer durations (in frames). Also, there is a data/ljspeech/phones.txt file to specify all the phones together with their indexes in dictionary.
   
After having these manifest files, please do the following to extract mel-spectrogram for training:
```
bash extract_fbank.sh --stage 0 --stop_stage 2 --nj 16
# nj: number of parallel jobs. 
# Have a look into the script if you need to change something
# Bash variables before "parse_options.sh" can be passed by CLI, e.g. "--key value".
```

Note that we default to use 16kHz data here. This will create feats/fbank and feats/normed_fbank, where Kaldi-style scp and ark files store the mel-spectrogram data. The normed features will be used for training.

If you want to use speaker-IDs (like LJSpeech, instead of using pretrained speaker embeddings such as xvectors) for training, please run:

```
make_utt2spk_id.py data/ljspeech/train/utt2spk data/ljspeech/val/utt2spk
# You can add more files in CLI. Will write utt2num_frames in the same directory to these files.
```

### Training
```
python train.py -c configs/${your_yaml} -m ${model_name}
# e.g. python train.py -c configs/lj_16k_gt_dur.yaml -m lj_16k_gt_dur
```
It will create `logs/${model_name}` for logging and checkpointing.
You can set `use_gt_dur` to false to turn on MAS algorithm. In this setting, it is better to set `add_blank` to true.

### Attack
If you have the checkpoints, you can perform the attack. The attack code for VoiceFlow is located in `attack/voiceflow`. To run our method (DurMI), simply execute:
```attack
python attack_vf_durloss.py -c configs/lj_16k_gt_dur.yaml -m logs/lj_16k_gt_dur --EMA --solver euler -t 100
```
`-c`: Path to the configuration file. Replace it with the config file corresponding to the dataset you want to use.

`-m`: Path to the model directory (where the checkpoint is located).

`--solver`: Specifies the solver method (e.g., euler).

`-t`: Number of diffusion steps. You can change this value as needed.





## Results

Our model achieves the following performance on :

### [GradTTS](https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS)

| Attack name   | LJSpeech (AUC / TPR@1%FPR) | LibriTTS (AUC / TPR@1%FPR) | VCTK (AUC / TPR@1%FPR) |
|---------------|-----------------------------|------------------------------|--------------------------|
| Naive Attack | 86.7 / 55.0            | 94.5 / 58.1               | 73.2 / **29.5**           |
| SecMI        | 94.4 / 70.3            | 90.2 / 55.2               | 72.8 / 8.1            |
| PIA          | 89.0 / 55.0            | 89.3 / 47.0               | 64.4 / 9.7            |
| PIAN         | 69.0 / 37.4            | 81.8 / 37.4               | 66.6 / 6.1            |
| **DurMI (Ours)**  | **99.8 / 98.9**        | **98.9 / 83.5**           | **76.8** / 9.6        |



### [WaveGrad2](https://github.com/keonlee9420/WaveGrad2)

| Attack name       | LJSpeech (AUC / TPR@1%FPR) | LibriTTS (AUC / TPR@1%FPR) | VCTK (AUC / TPR@1%FPR) |
|-------------------|-----------------------------|------------------------------|--------------------------|
| Naive Attack | 50.1 / 1.0                  | 54.3 / 0.6                   | 59.9 / 1.5               |
| SecMI        | 49.4 / 1.0                  | 47.6 / 0.3                   | 55.4 / 1.0               |
| PIA          | 50.8 / 0.4                  | 51.7 / 0.1                   | 52.1 / 0.8               |
| PIAN         | 50.3 / 0.1                  | 50.2 / 0.1                   | 44.7 / 0.1               |
| **DurMI (Ours)**  | **99.9 / 100.0**            | **100.0 / 100.0**            | **97.4 / 50.9**          |



### ROC curves comparing MIA methods on the Grad-TTS model across various datasets
![image](https://github.com/user-attachments/assets/f6702512-c63f-444d-88e5-dd6c83a086d2)



### Member vs. Non-member distribution separability using diffusion loss (Naive, SecMI, PIA, PIAN) vs. duration loss (DurMI) across datasets: LJSpeech (LJ), LibriTTS (Libri), and VCTK.
![image](https://github.com/user-attachments/assets/8ab3e3f3-39f3-44bc-80a7-8bc65497028e)
