Audio-Visual ASR from Multiple Views inside Smart Rooms

Published: 01 Jan 2006, Last Modified: 11 Apr 2025MFI 2006EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness. However, the vast majority of audio-visual automatic speech recognition (AVASR) studies assume frontal images of the speaker's face, which is not always the case in realistic human-computer interaction (HCI) scenarios. One such case of interest is HCI inside smart rooms, equipped with pan-tilt-zoom (PTZ) cameras that closely track the subject's head. Since however these cameras are fixed in space, they cannot necessarily obtain frontal views of the speaker. Clearly, AVASR from non-frontal views is required, as well as fusion of multiple camera views, if available. In this paper, we report our very preliminary work on this subject. In particular, we concentrate on two topics: first, the design of an AVASR system that operates on profile face views and its comparison with a traditional frontal-view AVASR system, and second, the fusion of the two systems into a multi-view frontal/profile system. We in particular describe our visual front end approach for the profile view system, and report experiments on a multi-subject, small-vocabulary, bimodal, multi-sensory database that contains synchronously captured audio with frontal and profile face video, recorded inside the IBM smart room as part of the CHIL project. Our experiments demonstrate that AVASR is possible from profile views, however the visual modality benefit is decreased compared to frontal video data
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview