Speech Reconstruction from Silent Lip and Tongue Articulation by Diffusion Models and Text-Guided Pseudo Target Generation
Abstract: This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing real speech.
This task falls under the umbrella of articulatory-to-acoustic (A2A) conversion and may also be referred to as a silent speech interface.
To overcome the domain discrepancy between silent and standard vocalized articulation, we introduce a novel pseudo target generation strategy.
It integrates the text modality to align with articulatory movements, thereby guiding the generation of pseudo acoustic features for supervised training on speech reconstruction from silent articulation.
Furthermore, we propose to employ a denoising diffusion probabilistic model as the fundamental architecture for the A2A conversion task and train the model using a combined training approach with the generated pseudo acoustic features.
Experiments show that our proposed method significantly improves the intelligibility and naturalness of the reconstructed speech in the silent speaking mode compared to all baseline methods.
Specifically, the word error rate of the reconstructed speech decreases by approximately 5\% when measured using an automatic speech recognition engine for intelligibility assessment, and the subjective mean opinion score for naturalness improves by 0.14.
Moreover, analytical experiments reveal that the proposed pseudo target generation strategy can generate pseudo acoustic features that synchronize better with articulatory movements than previous strategies.
Samples are available at our project page.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: This work involves the following modalities: speeh, ultrasound tongue images, lip videos and text. It contributes to multimedia/multimodal processing by advancing the field of speech reconstruction from ultrasound tongue images and lip videos recorded in a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing audible speech. Previous research has either concentrated on speech reconstruction in standard vocalized mode or encountered challenges in silent mode. To overcome these challenges, we introduce a novel pseudo target generation strategy, named the dubbing strategy. It integrates text information to align with articulatory movements, thereby guiding the generation of pseudo acoustic features for supervised training on speech reconstruction from silent articulation. Furthermore, we propose a diffusion-based articulatory-to-acoustic conversion architecture, comprising an articulation encoder for transforming lip videos and ultrasound tongue images into hidden articulatory representations and a diffusion-based spectrogram denoiser to synthesize acoustic features from random noise conditioned on these hidden representations step-by-step. The proposed model is trained with the pseudo acoustic features generated by the dubbing strategy using a combined training approach. Our proposed method significantly enhances the naturalness and intelligibility of speech reconstructed in the silent speaking mode, marking a substantial advancement in multimodal speech reconstruction technology.
Submission Number: 1829
Loading