13.1 A 0.22mm2 161nW Noise-Robust Voice-Activity Detection Using Information-Aware Data Compression and Neuromorphic Spatial-Temporal Feature Extraction

Ying Liu; Jie Li; Qining Zhang; Tianhao Zhao; Chenhao Shi; Ninghui Shang; Peiyu Chen; Xiaohuan Ge; Yufei Ma; Linxiao Shen; Zhixuan Wang; Ru Huang; Le Ye

13.1 A 0.22mm2 161nW Noise-Robust Voice-Activity Detection Using Information-Aware Data Compression and Neuromorphic Spatial-Temporal Feature Extraction

Ying Liu, Jie Li, Qining Zhang, Tianhao Zhao, Chenhao Shi, Ninghui Shang, Peiyu Chen, Xiaohuan Ge, Yufei Ma, Linxiao Shen, Zhixuan Wang, Ru Huang, Le Ye

Published: 01 Jan 2025, Last Modified: 13 May 2025ISSCC 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Nowadays, voice activation detection (VAD), typically consisting of the feature extractor (FE) and the intelligent engine (IE), is crucial for reducing the power consumption of the voice processing system (VPS) (Fig. 13.1.1 top). Normally, the always-on VAD dominates the power consumption while VPS remains inactive [1]. Therefore, the VAD should meet stringent power consumption requirement to extend the battery life of artificial intelligence of things (AIoT) devices. Additionally, achieving excellent inference accuracy and noise robustness is also essential for practical consideration. [1] proposed the analog FE and exploited binary NNs (BNNs), consuming $1\mu\mathrm{W}$ power but with only 85% accuracy. [2] presented the analog convolutional neural networks (CNNs) to reduce the power of IEs (108nW), but it only achieved about 90% accuracy in high signal-noise ratio (SNR) (>10dB) scenarios. Recently, the bio-inspired spike-based methods [4]–[5] have shown promising ultra-low-power (ULP) and intelligent potentials across various scenarios. [5] converted Mel-frequency cepstral coefficients (MFCC) features into spikes and trained two fully-connected spiking NNs (SNNs) layers, achieving >90% VAD accuracy across 0-5dB SNR. However, the spikes generation from the static MFCC induces excessive power consumption. [5] demonstrated 90% accuracy for ECG classification with 82nW power consumption. However, the level-crossing coding in [4] will generate more spikes when processing the voice whose frequency is much higher than ECG, resulting in more power consumption. To the best knowledge, a VAD system achieving ultra-low power consumption, excellent accuracy and noise robustness has not yet been presented.

Loading