# USR 2.0
**Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition**

![Overview](assets/overview_usr2.png)

## Introduction

This repository contains the implementation of **USR 2.0**, used to reproduce the main experiments in the accompanying paper. It is based on [PyTorch Lightning](https://www.pytorchlightning.ai/), with [Hydra](https://hydra.cc/docs/intro/) for configuration and [Weights & Biases](https://wandb.ai/) for logging.

## Preparation

### Installation

```bash
conda env create -f environment.yml
```

Adjust the environment as needed. To enable `fairseq`-based utilities, follow the instructions in `fairseq_manual/setup_data_utils.py`.

## Data

1. Download the raw video/audio datasets:
   - [LRS3](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html)
   - [LRS2](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html)
   - [VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html)
   - [AVSpeech](https://looking-to-listen.github.io/avspeech/)
   - [LibriSpeech](https://www.openslr.org/12)

2. Extract facial landmarks using:
   - [RetinaFace](https://github.com/biubug6/Pytorch_Retinaface)
   - [2D-FAN](https://github.com/1adrianb/face-alignment)  
   Alternatively, download precomputed landmarks from [this repo](https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages/blob/master/models/README.md)

3. Crop mouth ROIs:
   ```bash
   python preprocessing/extract_mouths.py \
     --src_dir ${SOURCE_DIR} \
     --tgt_dir ${TARGET_DIR} \
     --landmarks_dir ${LANDMARKS_DIR}
    ```
4. Download preprocessed CSVs with tokenised labels:
   - [LRS3 test set](https://drive.google.com/file/d/1eOZXM5LiJOK92EzXMC-eDyFtUKlNGIUJ/view?usp=sharing)
   - [LRS3 trainval set](https://drive.google.com/file/d/1AvdYktN5OKc8eNcwO-Xn9N9hSlwD0dQK/view?usp=sharing)
   - [LRS3 training set](https://drive.google.com/file/d/11NeU9zqNlFeHYmpr6CxnXANsCZyZdcu1/view?usp=sharing)
   - [LRS3 validation set](https://drive.google.com/file/d/17h7HwysmhrFVFImBWIQZUCMJ8xkrMkgZ/view?usp=sharing)
   - [LRS3+VoxCeleb2 combined](https://drive.google.com/file/d/1cRhgQdNYUniEaH7a-E7YjfdEDJJ3N16f/view?usp=sharing)
   - [LRS2+LRS3+VoxCeleb2+AVSpeech combined](https://drive.google.com/file/d/1DX4Afk_yn5fMgWHPEMilZotRLBfCu1cU/view?usp=sharing)

**Important**: Set the video and audio directory paths for each dataset in `conf/data/default.yaml`.

## Out-of-Distribution Results
### Robustness to Long Utterances

![Overview](assets/length_generalisation.png)

To reproduce the results for length-based generalisation (see Figure 3 in the paper, also shown above), run the following command using the pretrained **Base** model from the low-resource setting:

```bash
python main.py \
  test=True \
  experiment_name=length_generalisation \
  data.frames_per_gpu_val=700 \
  model/backbone=resnet_transformer_base \
  model.pretrained_model_path={PATH_TO_CHECKPOINT} \
  data.dataset.test_csv={PATH_TO_CSV} \
  decode.beam_size={BEAM_SIZE} \
  decode.ctc_weight={CTC_WEIGHT} \
  decode.maxlenratio=0.4 \
  num_workers=4
  ```
  Replace:

- `{PATH_TO_CHECKPOINT}` with the path to the pretrained [low-resource Base model](https://drive.google.com/file/d/1S-gTw2K-AaYAZFQFknx50-ymaj4qTfcX/view?usp=sharing).
- `{PATH_TO_CSV}` with the path to one of the length-bucketed test CSVs.
- `{BEAM_SIZE}` and `{CTC_WEIGHT}` with the desired decoding hyperparameters (1 and 0.0, respectively, for attention-based greedy decoding).

#### Test CSVs by Sequence Length

Download the evaluation CSV files for different input lengths from the following links:

- [100–150 frames](https://drive.google.com/file/d/1wEkQQTtHRDidlZlkyMuYZRyad2ETR_n-/view?usp=sharing)
- [150–200 frames](https://drive.google.com/file/d/16Iw3BFvrzJUM6-0JK4jD2bJU9IXJZ6fp/view?usp=sharing)
- [200–250 frames](https://drive.google.com/file/d/173_aTcd7GOHEzJSEA5ZrSZ1NyzRkbWG3/view?usp=sharing)
- [250–300 frames](https://drive.google.com/file/d/1N-cqlLc0CLZRU1VS0Ww1zcXeaNlJp_mJ/view?usp=sharing)
- [300–350 frames](https://drive.google.com/file/d/1GWSUgVaSVLu1x9SCeOOwedLVQwHNyEoh/view?usp=sharing)
- [350–400 frames](https://drive.google.com/file/d/1OHhStEacguOIjRCjWMvYXI17JwtbhEY4/view?usp=sharing)
- [400–450 frames](https://drive.google.com/file/d/179xVv5LdefBjMa3zchKzbGzTkAgrJFPM/view?usp=sharing)
- [450–500 frames](https://drive.google.com/file/d/1Y-2obSGIJxwJ-wrzwBR9IdJ_jAALLIVj/view?usp=sharing)
- [500–550 frames](https://drive.google.com/file/d/114drg2q-Lj2kfM81AF1-SfmCAzQMHeyK/view?usp=sharing)
- [550–600 frames](https://drive.google.com/file/d/1Fo2sUhyChYEcJzmiFheA3bXsZxR958_E/view?usp=sharing)
- [Combined](https://drive.google.com/file/d/14psWYLi9Qmo80pIBIEOH0IaYwlevPqdy/view?usp=sharing)

Each CSV (except the last) corresponds to a specific range of video lengths (in number of frames). After running inference on each subset, you can collect the WERs and plot them against the corresponding sequence length to reproduce USR 2.0's length generalisation curves from the paper.

### Robustness to Noise

| Modality | 10 dB | 5 dB  | 0 dB  | -5 dB | Avg   | 10 dB (>100f) | 5 dB  | 0 dB  | -5 dB | Avg   |
|----------|-------|-------|-------|-------|--------|----------------|-------|-------|-------|--------|
| **ASR**  | 5.2 | 13.4 | 44.0 | 94.4 | 39.3 | 3.8 | 10.6 | 42.8 | 98.3 | 38.9 |
| **AVSR** | 3.7 | 5.6  | 14.0 | 33.1 | 14.1 | 2.6 | 4.3  | 10.4 | 26.0 | 10.8 |

You can reproduce the robustness to noise results from Table 1 (reproduced below) using the following command:

```bash
python main.py \
  test=True \
  experiment_name=noise_generalisation \
  model/backbone=resnet_transformer_base \
  model.pretrained_model_path={PATH_TO_CHECKPOINT} \
  data.dataset.test_csv={PATH_TO_CSV} \
  data.noise_path={PATH_TO_NOISE} \
  decode.beam_size=30 \
  decode.ctc_weight=0.1 \
  decode.maxlenratio=0.4 \
  decode.snr_target={SNR} \
  num_workers=4
```

Replace:

- `{PATH_TO_CHECKPOINT}` with the full path to the pretrained low-resource Base model.
- `{PATH_TO_NOISE}` with the full path to the babble noise data, which can be downloaded [here](https://drive.google.com/file/d/1d8D6FcsftCotnE14nu9C2q1v6N0KnYCC/view?usp=sharing).
- `{PATH_TO_CSV}` with the path to the LRS3 test set CSV.
- `{SNR}` with one of: `10`, `5`, `0`, or `-5` (in dB).

### Robustness to OOD Datasets

| Modality | Dataset     | WER (%) |
|----------|-------------|---------|
| ASR      | LibriSpeech | 15.4 |
| VSR      | WildVSR     | 73.7 |
| AVSR     | AVSpeech    | 25.0 |

To reproduce the results in Table 2 of the paper on out-of-distribution (OOD) datasets (also shown above), run the following commands using the pretrained **Base** model (low-resource setting). All evaluations use greedy decoding (`beam_size=1`, `ctc_weight=0.0`).

#### LibriSpeech (ASR)

```bash
python main_libri.py \
  test=True \
  data.frames_per_gpu_val=700 \
  experiment_name=libri_generalisation \
  model/backbone=resnet_transformer_base \
  model.pretrained_model_path={PATH_TO_CHECKPOINT} \
  data.dataset.test_csv={PATH_TO_LIBRI_CSV} \
  decode.beam_size=1 \
  decode.ctc_weight=0.0 \
  decode.maxlenratio=0.4 \
  num_workers=4
```

- [Download LibriSpeech test-clean CSV](https://drive.google.com/file/d/1F0ewDWVjFeZ9ZAs7E251rSJEFsI8_C1o/view?usp=sharing)

### WildVSR (VSR)

```bash
python main_wild.py \
  test=True \
  data.frames_per_gpu_val=700 \
  experiment_name=wild_generalisation \
  model/backbone=resnet_transformer_base \
  model.pretrained_model_path={PATH_TO_CHECKPOINT} \
  data.dataset.test_csv={PATH_TO_WILD_CSV} \
  decode.beam_size=1 \
  decode.ctc_weight=0.0 \
  decode.maxlenratio=0.4 \
  num_workers=4
```

- [Download WildVSR test CSV](https://drive.google.com/file/d/1WHaH9vupxV4YP4rAKCeDiWBc9FIU6P0B/view?usp=sharing)

### AVSpeech (AVSR)

```bash
python main.py \
  test=True \
  data.frames_per_gpu_val=700 \
  experiment_name=avsp_generalisation \
  model/backbone=resnet_transformer_base \
  model.pretrained_model_path={PATH_TO_CHECKPOINT} \
  data.dataset.test_csv={PATH_TO_AVSP_CSV} \
  decode.beam_size=1 \
  decode.ctc_weight=0.0 \
  decode.maxlenratio=0.4 \
  num_workers=4
```

- [Download AVSpeech test CSV](https://drive.google.com/file/d/1CBFCfM3xcZ--TAkyEOYippAaf41v7p5r/view?usp=sharing)

## In-distribution Results

We provide checkpoints for reproducing the in-distribution results (see Table 3 in the paper, reproduced below).

### Low-resource

| Model     | Pretraining Dataset | V (%) | A (%) | AV (%) | Checkpoint     |
|-----------|---------------------|-------|-------|--------|----------------|
| Base      | LRS3                | 36.2  | 3.0   | 2.9    | [Download](https://drive.google.com/file/d/1S-gTw2K-AaYAZFQFknx50-ymaj4qTfcX/view?usp=sharing)   |
| Base Plus | LRS3+Vox2           | 26.4  | 2.5   | 2.4    | [Download](https://drive.google.com/file/d/15K4I2eYHU0CzjsUSRdWickU9dIX0fFAh/view?usp=sharing)   |
| Large     | LRS3+Vox2           | 23.7  | 2.3   | 2.2    | [Download](https://drive.google.com/file/d/1FIculf1Mfo73Y2f_dSu-HDw1Jb36pW-r/view?usp=sharing)   |

### High-resource

| Model     | Pretraining Dataset     | V (%) | A (%) | AV (%) | Checkpoint     |
|-----------|-------------------------|-------|-------|--------|----------------|
| Base Plus | LRS3+Vox2               | 24.8  | 1.4   | 1.2    | [Download](https://drive.google.com/file/d/18vmJjdem5XPOA8bmizybIW5sLJuHMdRR/view?usp=sharing)   |
| Large     | LRS3+Vox2               | 21.5  | 1.3   | 1.0    | [Download](https://drive.google.com/file/d/1XDNLfZMv_nn8ALiFmDh3asrJKGC2uHs1/view?usp=sharing)   |
| Huge      | LRS2+LRS3+Vox2+AVS      | 17.6  | 0.9   | 0.8    | [Download](https://drive.google.com/file/d/1LzFOTYu45zCLOHGVLQt7pMGjw6jmmo9Y/view?usp=sharing)   |

To evaluate a pretrained model, run the following command, replacing the placeholders with appropriate paths and model configuration:

```bash
python main.py \
  experiment_name=base_low_resource_lrs3 \
  test=True \
  data.dataset.test_csv={PATH_TO_TEST_CSV} \
  model/backbone={MODEL_TYPE} \
  model.pretrained_model_path={PATH_TO_CHECKPOINT}
```

Replace `{MODEL_TYPE}` with one of the following:

- `resnet_transformer_base`
- `resnet_transformer_baseplus`
- `resnet_transformer_large`
- `resnet_transformer_huge`

Ensure that:

- `data.dataset.test_csv` points to the correct test CSV (LRS3 test set)
- `model.pretrained_model_path` is set to the full path of the downloaded `.pth` checkpoint

## Training

To replicate the full results, adapt the following scripts. The self-supervised checkpoints (whose paths need to be specified in `model.pretrained_model_path`) can be found at the [USR repo](https://github.com/ahaliassos/usr). The self-supervised checkpoint for the Huge model can be found [here](https://drive.google.com/file/d/1KARf06-70SpI6kkHfoaaiDksL-1M_ojy/view?usp=sharing). To train from scratch, leave `model.pretrained_model_path` blank.

### Low-resource

| Model     | Dataset      | Script                                   |
|-----------|--------------|------------------------------------------|
| Base      | LRS3         | `scripts/train/base_low_resource_lrs3.sh`      |
| Base Plus | LRS3+Vox2 | `scripts/train/baseplus_low_resource_lrs3vox2.sh`  |
| Large     | LRS3+Vox2 | `scripts/train/large_low_resource_lrs3vox2.sh`     |

### High-resource

| Model     | Dataset      | Script                                   |
|-----------|--------------|------------------------------------------|
| Base Plus | LRS3+Vox2 | `scripts/train/baseplus_high_resource_lrs3vox2.sh` |
| Large     | LRS3+Vox2 | `scripts/train/large_high_resource_lrs3vox2.sh`    |
| Huge     | LRS2+LRS3+Vox2+AVS | `scripts/train/huge_high_resource_lrs2lrs3vox2avsp.sh`    |
