# Overview
This repository provide the training and testing code for the paper: "Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments"

We provide the audio examples of the successful and failure cases shown in the paper appendix. The audios for the 8 cases are shown in the 8 separated folders. 

In each folder, we include the following 5 audios:
- `*-mix.wav` is the Mixed Audio input to our model. Our model aims at extracting the target speaker from this audio Mixture. 
- `*-{pos/neg}.wav` is the Positive or Negative Enrollment used by our model to learn the target speaker's voice characteristics.
- `*-tgt.wav` is the ground truth target speaker's voice in the Mixed Audio.
- `*-out.wav` is the target speaker's voice in the Mixed Ausdio extracted by our model. 

# Prepare datasets

1. download the [LibriSpeech dataset](https://www.openslr.org/12) `train-360`, `dev-clean`, `test-clean` splits under `data/LibriSpeech` repository, and provide their names in the `{train/val}_dataset_dir` variable in the yaml file used in training.

2. download the [WHAM! noise dataset](http://wham.whisper.ai/) under `data/wham_noise` repository, and provide its path in the *noise_dir* variable in the yaml file used in training.

3. (optional) download BRIR datasets for binaural model training. We used the same BRIR datasets as the LookOnceToHear, which provide the self-contained datasets in [here](https://drive.google.com/drive/u/1/folders/1-Jx23GXdjPe33EF5jGZpj6zn-kIm5jHR). Download and unzip these datasets in the `data` repository and provide their name in the *brir_dir* variable in the yaml file used in training.

# (optional) Download the baselines' implementation for fine-tuning or evaluation

1. To perform fine-tuning or evaluation on the TCE baseline method, clone the [TCE repository](https://github.com/chentuochao/Target-Conversation-Extraction) and move the following directories under the `src_tce` folder:
- `hl_modules`
- `losses`
- `models`
- `utils.py`
- `metrics/metrics.py`

2. To perform fine-tuning or evaluation on the LookOnceToHear baseline method, clone the [LookOnceToHear repository](https://github.com/vb000/LookOnceToHear/tree/main) and move the `src_lookonce` directory under our directory.

3. To perform evaluation on the SpeakerBeam baseline method, follow the [SpeakerBeam repository](https://github.com/BUTSpeechFIT/speakerbeam) to download the SpeakerBeam model trained on LibriSpeech dataset, and move the `src/models` folder under our `src_speakerbeam` folder.

# Training 

Our first stage training performs knowledge distillation on a trained TFGridnet encoder. We used the trained encoder from [here](https://drive.google.com/file/d/1CP0zbZExcqvNLdP9epyhY4fEVp_oQr59/view) under the `runs/embed/best.ckpt`. Download the checkpoint and move under the `model` directory.

Both the first and the second stage training are accomplished with the same training code, but with different hyperparameters (i.e. yaml file) provided. For example, to train the monaural model:

- first stage training: ```python train.py Hyperparameter_monaural_stage-1.yaml```

- second stage training: ```python train.py Hyperparameter_monaural_stage-2.yaml```

To train the binaural model, use the `Hyperparameter_binaural_stage-{1/2}.yaml` files.

After training, the checkpoints and the losses are saved in the `output` directory

# Evaluation

To perform evaluation, specify the evaluated model's checkpoint path in the `eval_{monaural/binaural}.py` file `model` section, change the hyperparameters in the `hyperparams` section, and run `python eval_{monaural/binaural}.py` to perform evaluation.
