Short-Duration Speaker Verification by Joint Filter Superposition-Based Multi-Dimensional Central Difference Feature Extraction and Res2Block-Based Bidirectional Sampling

Published: 2024, Last Modified: 30 Jul 2025IEEE Trans. Consumer Electron. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: As the durations of the short utterances are small, it is difficult to learn sufficient discriminative information. To address this issue, in the acoustic end, we propose a Bark-scaled Gaussian and linear filter bank superposition acoustic feature extraction method (BGLCC), and a multi-dimensional central difference dynamic feature extraction method (MDCD). The Bark-scaled Gaussian filter bank focuses on low-frequency information, while the linear filter has a higher distribution density in the high-frequency domain of speech than the Bark-scaled Gaussian filter. Thus, after filter superposition, the linear filter bank can be used to compensate for the high-frequency information of the Bark-scaled Gaussian filter. In addition, the multi-dimensional central difference method better captures the dynamic features of speakers to improve the performance of short utterance speaker verification. Also, to enhance the discriminative embedding, a novel Res2Block-based bidirectional sampling multi-scale feature aggregation method is proposed at the network end. The Res2Block-based bi-directional sampling architecture enhances the discriminative embeddings through different layer levels of local or global effective feature multi-scale aggregation strategy. Extensive experiments are performed on short-duration text-independent speaker verification datasets derived from the VoxCeleb, SITW, and NIST SRE corpora, which contain speech samples of varying lengths and scenarios. The results demonstrate that the proposed method outperforms the existing acoustic feature extraction approach and the state-of-the-art deep learning architectures by at least 11% and 18%, respectively, on the test set.
Loading