Joint Token-level and Phrase-level Contextual Biasing for Automatic Speech Recognition with Large Language Models

ACL ARR 2024 December Submission1292 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

End-to-end Automatic Speech Recognition (ASR) models often face challenges in accurately transcribing contextually relevant keywords, such as proper nouns or user-specific entities. Existing approaches leverage large language models (LLMs) to improve keyword recognition through token-level or phrase-level biasing. However, token-level approaches struggle to ensure holistic generation of keyword phrases, while phrase-level approaches may compromise the accuracy of non-keyword transcriptions. To overcome these limitations, we propose a novel joint approach that integrates token-level and phrase-level biasing, leveraging their complementary strengths. Our approach incorporates LLMs using a late-fusion mechanism, combining ASR and LLM outputs at both token and phrase levels. Experiments on Chinese and English datasets demonstrate that our approach achieves state-of-the-art performance on keyword-related metrics while preserving the high accuracy on non-keyword text. Ablation studies also confirm that the token-level and phrase-level components both significantly contribute to the improvement, complementing each other in our joint approach. The code and models will be publicly available at \url{https://github.com/}.

Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: automatic speech recognition, speech technologies
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English, Chinese
Submission Number: 1292
Loading