Joint Token-level and Phrase-level Contextual Biasing for Automatic Speech Recognition with Large Language Models
End-to-end Automatic Speech Recognition (ASR) models often face challenges in accurately transcribing contextually relevant keywords, such as proper nouns or user-specific entities. Existing approaches leverage large language models (LLMs) to improve keyword recognition through token-level or phrase-level biasing. However, token-level approaches struggle to ensure holistic generation of keyword phrases, while phrase-level approaches may compromise the accuracy of non-keyword transcriptions. To overcome these limitations, we propose a novel joint approach that integrates token-level and phrase-level biasing, leveraging their complementary strengths. Our approach incorporates LLMs using a late-fusion mechanism, combining ASR and LLM outputs at both token and phrase levels. Experiments on Chinese and English datasets demonstrate that our approach achieves state-of-the-art performance on keyword-related metrics while preserving the high accuracy on non-keyword text. Ablation studies also confirm that the token-level and phrase-level components both significantly contribute to the improvement, complementing each other in our joint approach. The code and models will be publicly available at \url{https://github.com/}.