# CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition
Code associated with the paper: CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition.

This repository provides a minimal, standalone implementation of CTC-DRO. It includes two main components:

- **CTC-DRO loss function:** Implemented in [`ctc_dro.py`](./ctc_dro.py), this file contains a standalone implementation of the CTC-DRO loss function.
- **Length-matched batch sampler:** Implemented in [`duration_batch_sampler.py`](./duration_batch_sampler.py), this batch sampler returns batches of data points from the same group whose durations sum to a specified target length.

These modules are designed for seamless integration with the [ESPNet](https://github.com/espnet/espnet) framework but can also be incorporated into other codebases as long as the input formats are compatible.

---

## Batching

The file [`duration_batch_sampler.py`](./duration_batch_sampler.py) implements the `DurationBatchSampler`, which ensures that:
- Each batch contains only examples from one language or category.
- The total duration of audio in each batch is approximately equal to a specified target (`duration_batch_length`).
- Batches are uniformly shuffled so that examples from different groups are well distributed over training iterations.

**Requirements:**
1. **Shape files:**  
   A set of files containing the duration of each audio file. Each file should be formatted with rows of:  
   `<data_point_id> <duration>`  
   (The loader uses a CSV integer format.)
2. **utt2category file:**  
   A file mapping each data point to its group. Each line should be formatted as:  
   `<data_point_id> <group_id>`

The sampler uses these files to verify that all keys match, sort utterances by duration (largest first), and apply a greedy bin-packing algorithm to generate duration-equalized batches.

---

## CTC-DRO implementation

The file [`ctc_dro.py`](./ctc_dro.py) contains the implementation of the CTC-DRO loss function.

**Requirements:**
1. **category2numbatches file:**  
   A file with rows formatted as:  
   `<group_id> <num_batches>`  
   indicating the number of batches for each group.
2. **utt2category file:**  
   A file with rows formatted as:  
   `<data_point_id> <group_id>`  
   mapping each data point to its group.

The `init_weights` method of `DROCTCLoss` loads these files from a specified training directory and initializes the internal state required for group-wise loss aggregation.

---

## Usage example

To integrate the CTC-DRO loss into your codebase, follow these steps:

1. **Copy the file:**  
   Copy [`ctc_dro.py`](./ctc_dro.py) into your project directory.

2. **Import and initialize:**  
   Import the `DROCTCLoss` class and initialize it with the required hyperparameters:
   
   ```python
   from dro_ctc import DROCTCLoss

   # Initialize the loss function with your desired settings:
   # - blank: the blank token for CTC loss
   # - zero_infinity: whether to zero out infinite losses
   # - dro_group_count: total number of groups
   # - dro_step_size: step size for updating group weights
   # - dro_q_epsilon: small constant to prevent group weights from reaching zero
   # - smoothing: smoothing parameter for the weight update (set >0 to use smoothing)
   loss_fn = DROCTCLoss(blank=0, zero_infinity=True, dro_group_count=6, dro_step_size=0.0001, dro_q_epsilon=1e-10, smoothing=0.1)

   # Initialize weights using your training and validation directories.
   # These directories should contain 'category2numbatches' and 'utt2category' files.
   loss_fn.init_weights(train_file="path/to/train_dir", valid_file="path/to/valid_dir")

3. **Compute the loss:**

    When processing a batch, compute the loss as follows:

    ```python
    # log_probs: Log probabilities from your model (Tensor)
    # targets: Target transcript tokens (Tensor)
    # input_lengths: Lengths of input audio for each example (Tensor)
    # target_lengths: Lengths of target transcripts for each example (Tensor)
    # utt_id: List of data point IDs for each example (used to map examples to groups)

    loss = loss_fn(log_probs, targets, input_lengths, target_lengths, utt_id)
    ```

    The function returns the CTC-DRO scaled loss for training purposes, and it returns the standard CTC loss during validation.

--- 

# Usage with ESPNet

We run our experiments by implementing our code as a new submodule of the [ESPNet](https://github.com/espnet/espnet) framework, and adding our CTC-DRO loss and batch sampler to the ESPNet library. We include general instructions for using our code inside ESPNet here, but omit the full implementation due to space constraints. We provide some relevant files inside `espnet_files`. We will do a full release of the ESPNet implementation along with the camera-ready version of the paper. 

## Dataset

We use the [ML-SUPERB 2.0 dataset](https://github.com/espnet/espnet/tree/master/egs2/ml_superb/asr1).

After downloading and extracting the dataset, update the dataset path (i.e., `ML-SUPERB` variable) in `db.sh`.

---

## Configuration

Configuration files for model training and inference are located in the `conf/` directory. We expect that configurations for different language subsets and experimental settings will be organized in separate subfolders. For example, the configuration files for Experiment 1 may be stored in `conf/exp_001/`.

Within the `conf/` directory, you will find example configuration files for the three training approaches:
- **CTC baseline:**  
  - `mms_example_baseline.yaml`  
  - `xlsr_example_baseline.yaml`
- **Group DRO:**  
  - `mms_example_group_dro.yaml`  
  - `xlsr_example_group_dro.yaml`
- **CTC-DRO:**  
  - `mms_example_ctc_dro.yaml`  
  - `xlsr_example_ctc_dro.yaml`

Additionally, please copy the `train_asr.yaml` file into each experiment folder as it contains the configuration for data preprocessing.

Below are example configuration snippets:

**CTC-DRO:**
```yaml
ctc_conf:
    accumulation: true
    agg: sum
    ctc_type: droctc
    dro_group_count: 6
    dro_q_epsilon: 1.0e-10
    dro_step_size: 0.0001
    smoothing: 0.1
    normalize_grad: true
```

**Group DRO:**
```yaml
ctc_conf:
    accumulation: false
    agg: mean
    ctc_type: droctc
    dro_group_count: 6
    dro_q_epsilon: 1.0e-10
    dro_step_size: 0.0001
    smoothing: 0.0
    normalize_grad: false
```

Other training hyperparameters (e.g., `accum_grad`, `batch_size`, `encoder_conf`, `optim_conf`, etc.) are defined within these configuration files. For hyperparameter sweeps, adjust the global variables at the top of `sweep_baseline.py`, `sweep_group_dro.py`, and `sweep_ctc_dro.py`, and then run these scripts to automatically generate new configuration files.

---

## Running experiments

Experiments are controlled via Makefiles. Before running any experiments, populate `cluster_info.mk` with:
- `DUMP_DIR_BASE`: Location of preprocessed data files (e.g., `scr/dump`)
- `EXP_DIR_BASE`: Directory to save models (e.g., `scr/exp`)
- `ASR_STATS_DIR_BASE`: Directory containing dataset statistics (typically the same as `EXP_DIR_BASE`)

For each experiment, generate the appropriate Makefile:
- **CTC baseline:** Run `create_makefile_baseline.py` to create `exp001_auto_baseline.mk`
- **Group DRO:** Run `create_makefile_group_dro.py` to create `exp001_auto_group_dro.mk`
- **CTC-DRO:** Run `create_makefile_ctc_dro.py` to create `exp001_auto_ctc_dro.mk`

An example Makefile is provided as `exp001_m.mk`. Include the generated Makefile in the main `Makefile` to run experiments.

The commands for pre-processing data before training are:

### Pre-processing
```bash
make preprocess
make preprocess-groups
```

### Training

Supported hyperparameter options include:
- Step sizes: `0.001`, `0.0001`
- Smoothing values: `0.1`, `0.5`, `1.0`

To train MMS or XLS-R models with CTC-DRO:
```bash
make train_asr_mms_aleb_dro_<step-size>_la_<smoothing>
make train_asr_xlsr_aleb_dro_<step-size>_la_<smoothing>
```

For example, with a step size of 0.001 and smoothing of 0.1:
```bash
make train_asr_mms_aleb_dro_0.001_la_0.1
make train_asr_xlsr_aleb_dro_0.001_la_0.1
```

To evaluate these models:
```bash
make eval_asr_mms_aleb_dro_0.001_la_0.1
make eval_asr_xlsr_aleb_dro_0.001_la_0.1
```

To train and evaluate models with Group DRO:
```bash
make train_asr_<model>_aleb_dro_<step-size>_base
make eval_asr_<model>_aleb_dro_<step-size>_base
```

To train and evaluate baseline models, specify the chosen learning rate:
```bash
make train_<model>_ctc_aleb_<learning_rate>
make eval_<model>_ctc_aleb_<learning_rate>
```

Evaluation results will be saved in the `results/EXPERIMENT_ID/` directory.

### Customization

You can customize the languages for training and evaluation by modifying the `SELECTED_LANGUAGES` and `DATASETS` variables in the Makefile.
For example:
```makefile
SELECTED_LANGUAGES=pol,spa,ces,ron,nan,cmn
DATASETS=M-AILABS,voxforge,commonvoice,fleurs,commonvoice,fleurs
```
Modify the experiment settings in the Makefile:
```makefile
EXPERIMENT_ID=exp_001  # Experiment identifier
DATA_SUBSET=1h         # Data duration (10min or 1h)
```

---


