Leveraging Self-Supervised Audio-Visual Pretrained Models to Improve Vocoded Speech Intelligibility in Cochlear Implant Simulation

Richard Lee Lai, Jen-Cheng Hou, I-Chun Chern, Kuo-Hsuan Hung, Yi-Ting Chen, Mandar Gogate, Tughrul Arslan, Amir Hussain, Chii-Wann Lin, Yu Tsao

Published: 2026, Last Modified: 28 May 2026IEEE Trans. Biomed. Eng. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Objective: Individuals with hearing impairments face challenges in their ability to comprehend speech, particularly in noisy environments. This study explores the effectiveness of audio-visual speech enhancement (AVSE) in improving the intelligibility of vocoded speech in cochlear implant (CI) simulations. Methods: We propose a speech enhancement framework called Self-Supervised Learning-based AVSE (SSL-AVSE), which uses visual cues such as lip and mouth movements along with corresponding speech. Features are extracted using the AV-HuBERT model and refined through a bidirectional LSTM. Experiments were conducted using the Taiwan Mandarin speech with video (TMSV) dataset. Results: Objective evaluations showed improvements in PESQ from 1.43 to 1.67 and in STOI from 0.70 to 0.74. NCM scores increased by up to 87.2% over the noisy baseline. Subjective listening tests further demonstrated maximum gains of 45.2% in speech quality and 51.9% in word intelligibility. Conclusion: SSL-AVSE consistently outperforms audio-only speech enhancement (AOSE) and conventional AVSE baselines. Listening tests with statistically significant results confirm its effectiveness. In addition to its strong performance, SSL-AVSE demonstrates cross-lingual generalization: although it was pretrained on English data, it performs effectively on Mandarin speech. This finding highlights the robustness of the features extracted by a pretrained foundation model and their applicability across languages. Significance: To the best of our knowledge, no prior work has explored the application of AVSE to CI simulations. This study provides the first evidence that incorporating visual information can significantly improve the intelligibility of vocoded speech in CI scenarios.
Loading