Quantifying The Effect Of Simulator-Based Data Augmentation For Speech Recognition On Augmented Reality Glasses

Riku Arakawa, Mathieu Parvaix, Chiong Lai, Hakan Erdogan, Alex Olwal

Published: 01 Jan 2024, Last Modified: 30 Sept 2024ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Augmented reality (AR) glasses have an immense potential for enhancing conversations by leveraging speech recognition to display real-time transcription or translation, for example, to assist people with hearing impairments or for people conversing in a non-native language. For deployment in real environments, such systems, however, need to be able to separate the speech of interest from noise and other speakers. In this paper, we evaluate the effectiveness of leveraging a room simulator to generate large amounts of simulated training data for such front-end sound separation models, to complement the ideal, but costly, collection of real-world data recorded on the device. Using both recorded and simulated impulse responses (IRs), we demonstrate that the use of simulation data is an effective method for training models that can ultimately enhance speech recognition performance in real-world settings. Furthermore, we show that performance can be further improved by adding microphone directivity in the room simulation, and by fusing synthetic data with a small amount of real IRs. Our results also suggest that existing room simulators would benefit from incorporating the head shadow effect, given its significant impact on multi-microphone recordings on AR glasses.