Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models

ACL ARR 2024 June Submission5615 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper introduces $\textbf{De}$coder-$\textbf{C}$entric $\textbf{R}$egularisation in $\textbf{E}$ncoder-$\textbf{D}$ecoder (DeCRED) architecture for automatic speech recognition, where auxiliary classifier(s) are introduced in layers of the decoder module. Leveraging these classifiers, we propose two decoding strategies that re-estimate the next token probabilities. Pilot experiments conducted on the independent in-domain datasets identify the suitable placement and weighting of the auxiliary classifiers, resulting in a consistent word-error-rate (WER) reduction of up to 9% relative across different model sizes. Further experiments on a collection of multi-domain English datasets showed that DeCRED obtained competitive WERs as compared to Whisper-medium and outperformed OWSM v3; while relying only on a fraction of training data and model size. Finally, we also study the generalisation capabilities of DeCRED by evaluating on out-of-domain datasets, where we show an absoulte reduction of 2.7 and 2.9 WERs on AMI and Gigaspeech datasets respectively.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: automatic speech recognition, speech technologies
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 5615
Loading