Joint Token-level and Phrase-level Contextual Biasing for Automatic Speech Recognition with Large Language Models
Abstract: End-to-end Automatic Speech Recognition (ASR) models often face challenges in accurately transcribing contextually relevant keywords, such as proper nouns or user-specific entities.
Existing approaches leverage large language models (LLMs) to improve keyword recognition through token-level or phrase-level biasing.
However, token-level approaches struggle to ensure holistic generation of keyword phrases, while phrase-level approaches may compromise the accuracy of non-keyword transcriptions.
To overcome these limitations, we propose a novel joint approach that integrates token-level and phrase-level biasing, leveraging their complementary strengths.
Our approach incorporates LLMs using a late-fusion mechanism, combining ASR and LLM outputs at both token and phrase levels.
Experiments on Chinese and English datasets demonstrate that our approach achieves state-of-the-art performance on keyword-related metrics while preserving the high accuracy on non-keyword text.
Ablation studies also confirm that the token-level and phrase-level components both significantly contribute to the improvement, complementing each other in our joint approach.
The code and models will be publicly available at \url{https://github.com/}.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: automatic speech recognition, speech technologies
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English, Chinese
Submission Number: 1292
Loading