Energy and Computation Efficient Audio-Visual Voice Activity Detection Driven by Event-Cameras

Arman Savran, Raffaele Tavarone, Bertrand Higy, Leonardo Badino, Chiara Bartolozzi

30 Jan 2020OpenReview Archive Direct UploadReaders: Everyone

Abstract: We propose a novel method for computationally efficient audio-visual voice activity detection (VAD) where visual temporal information is provided by an energy efficient event-camera (EC). Unlike conventional cameras, ECs perform on-chip low-power pixel-level change detection, adapting the sampling frequency to the dynamics of the activity in the visual scene and removing redundancy, hence enabling energy and computational efficiency. In our VAD pipeline, first, lip activity is located and detected jointly by a probabilistic estimation after spatio-temporal filtering. Then, over the lips, a feather-weight speech-related lip motion detection is performed with minimum false negative rate to activate a highly accurate but expensive acoustic deep neural networks-based VAD. Our experiments show that ECs are accurate at detecting and locating lip activity; and EC-driven VAD can result in considerable savings in computations as well as can substantially reduce false positive rates in low acoustic signal-to-noise ratio conditions.

0 Replies