Abstract: Automatic Speech Recognition (ASR) models pre-trained on large-scale speech datasets have achieved significant breakthroughs compared with traditional methods.
However, mainstream pre-trained ASR models encounter challenges in distinguishing homophones, which have close or identical pronunciations.
Previous studies have introduced visual auxiliary cues to address this challenge, yet the sophisticated use of lip movements falls short in correcting homophone errors.
On the other hand, the fusion and utilization of scene images remain in an exploratory stage, with performance still inferior to the pre-trained speech model.
In this paper, we introduce Contextual Image-Enhanced Automatic Speech Recognition (CIEASR), a novel multimodal speech recognition model that incorporates a new cue fusion method, using scene images as soft prompts to correct homophone errors.
To mitigate data scarcity, we refine and expand the VSDial dataset for extensive experiments, illustrating that scene images contribute to the accurate recognition of entity nouns and personal pronouns.
Our proposed CIEASR achieves state-of-the-art results on VSDial and Flickr8K, significantly reducing the Character Error Rate (CER) on VSDial from 3.61\% to 0.92\%.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: We introduce a novel multimodal speech recognition model that involves three modalities: images, speech, and text. Our primary goal is to tackle the challenge of homophone discrimination, a task single-modality speech systems struggle with. Traditional image-enhanced speech recognition models mainly rely on strictly aligned lip movements, limiting their widespread application in pre-trained paradigms. Although some studies have begun to explore scene images, their performance does not match that of pre-trained ASR models. Contextual images can provide semantic information for conversational scenarios. Our research primarily investigates whether they can further enhance the performance of pre-trained ASR models and correct homophone confusion errors. We demonstrate their practical value in distinguishing homophones and achieve state-of-the-art results on two tri-modally aligned datasets VSDial and Flickr8K.
Supplementary Material: zip
Submission Number: 5419
Loading