# N-best Re-ranking for Multilingual LID+ASR
This project provides N-best re-ranking, a simple inference procedure, for improving multilingual speech recognition (ASR) "in the wild" where models are expected to first predict language identity (LID) before transcribing. Our method considers N-best LID predictions for each utterance, runs the corresponding ASR in N different languages, and then uses external features over the candidate transcriptions to determine re-rank. 

The workflow is as follows: 1) run LID+ASR inference (MMS and Whisper are supported), 2) compute external re-ranking features, 3) tune feature coefficients on dev set, and 4) apply on test set.

For more information about our method, please refer to the paper: ["Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking"](https://arxiv.org/abs/2409.18428).

## 1) Commands to Run LID+ASR Inference

### Data Prep
Prepare a text file with one path to a wav file in each line:
```
#/path/to/wav/list
/path/to/audio1.wav
/path/to/audio2.wav
/path/to/audio3.wav
```

The following workflow also assumes that LID and ASR references are available (at least for the dev set). We use [3-letter iso codes](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l4017_langs.html) for both Whisper and MMS.

Next run either Whisper or MMS based LID+ASR.

### Whisper
Refer to the [Whisper documentation](https://github.com/openai/whisper) for installation instructions.

First run LID:
```
python whisper/infer_lid.py --wavs "path/to/wav/list" --dst "path/to/lid/results" --model large-v2 --n 10
```
Note that the size of the N-best list is set as 10 here.

Then run ASR, using the top-N LID predictions:
```
python whisper/infer_asr.py --wavs "path/to/wav/list" --lids "path/to/lid/results"/nbest_lid --dst "path/to/asr/results" --model large-v2
```

### MMS
Refer to the [Fairseq documentation](https://github.com/facebookresearch/fairseq/tree/main) for installation instructions.

Prepare data and models following the [instructions from the MMS repository](https://github.com/facebookresearch/fairseq/tree/main/examples/mms). Note that the MMS backend expects a slightly different wav list format, which can be obtained via:
```
python mms/format_wav_list.py --src "/path/to/wav/list" --dst "/path/to/wav/manifest.tsv"
```
Note that MMS also expects LID references in a file named `"/path/to/wav/manifest.lang"`.

Then run LID:
```
cd "path/to/fairseq/dir"
PYTHONPATH='.'  python3  examples/mms/lid/infer.py "path/to/dict/dir" --path "path/to/model" --task audio_classification  --infer-manifest "path/to/wav/manifest.tsv" --output-path "path/to/lid/results" --top-k 10
```
Note that the size of the N-best list is set as 10 here.

Then run ASR, using the top-N LID predictions. Since MMS uses language-specific parameters, we've parallelized inference across languages:
```
#Split data by language
python mms/split_by_lang.py --wavs_tsv "/path/to/wav/manifest.tsv" --lid_preds "path/to/lid/results"predictions.txt --dst "path/to/data/split"

#Write language-specific ASR python commands to an executable file
mms/make_parallel_single_runs.py --dump "path/to/data/split" --model "path/to/model" --dst "path/to/asr/results" --fairseq_dir "path/to/fairseq/dir" > run.sh

#Running each language sequentially (you can also parallelize this)
. ./run.sh

#Merge language-specific results back to original order
python mms/merge_by_run.py --dump "path/to/data/split" --exp "path/to/asr/results"
```

## 2) Commands to Compute External Re-ranking Features

### MaLA - Large Language Model
```
python mala/infer.py --txt "path/to/asr/results"/nbest_asr_hyp --dst "path/to/lm/results"
```

### NLLB - Written LID Model
Download the model from the [official source](https://github.com/facebookresearch/fairseq/tree/nllb#lid-model).

```
python nllb/infer.py --txt "path/to/asr/results"/nbest_asr_hyp --dst "path/to/wlid/results" --model "path/to/nllb/model"
```

### MMS-Zeroshot - U-roman Acoustic Model
Download the model from the [official source](https://huggingface.co/spaces/mms-meta/mms-zeroshot/tree/main).

First run u-romanization on the N-best ASR hypotheses:
```
python mms-zs/uromanize.py --txt "path/to/asr/results"/nbest_asr_hyp --lid "path/to/lid/results"/nbest_lid --dst "path/to/uasr/results" --model "path/to/mms-zeroshot"
```

Then compute the forced alignment score using the MMS-Zeroshot model:
```
python mms-zs/falign.py --uroman_txt "path/to/uasr/results"/nbest_asr_hyp_uroman --wav "path/to/wav/list" --dst "path/to/uasr/results" --model "path/to/mms-zeroshot"
```

## 3) Commands to Tune Feature Coefficients
```
python rerank/tune_coefficients.py --slid "path/to/lid/results"/slid_score --asr "path/to/asr/results"/asr_score --wlid "path/to/wlid/results"/wlid_score --lm "path/to/lm/results"/lm_score --uasr "path/to/uasr/results"/uasr_score --dst "path/to/rerank/results" --ref_lid "ground-truth/lid" --nbest_lid "path/to/lid/results"/nbest_lid --ref_asr "ground-truth/asr" --nbest_asr "path/to/asr/results"/nbest_asr_hyp
```

## 4) Commands to Apply on Test Set
```
python rerank/rerank.py --slid "path/to/lid/results"/slid_score --asr "path/to/asr/results"/asr_score --wlid "path/to/wlid/results"/wlid_score --lm "path/to/lm/results"/lm_score --uasr "path/to/uasr/results"/uasr_score --dst "path/to/rerank/results" --ref_lid "ground-truth/lid" --nbest_lid "path/to/lid/results"/nbest_lid --ref_asr "ground-truth/asr" --nbest_asr "path/to/asr/results"/nbest_asr_hyp --w "path/to/rerank/results"/best_coefficients
```

The re-ranked LID and ASR will be in `"path/to/rerank/results"/reranked_1best_lid` and `"path/to/rerank/results"/reranked_1best_asr_hyp` respectively.

# Citation
```
@article{yan2024wild,
  title={Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking},
  author={Brian Yan, Vineel Pratap, Shinji Watanabe, Michael Auli},
  journal={arXiv},
  year={2024}
}
```
