TB-ResNet: Bridging the Gap from TDNN to ResNet in Automatic Speaker Verification with Temporal-Bottleneck Enhancement

Sunmook Choi; Sanghyeok Chung; Seungeun Lee; Soyul Han; Taein Kang; Jaejin Seo; Il-Youp Kwak; Seungsang Oh

TB-ResNet: Bridging the Gap from TDNN to ResNet in Automatic Speaker Verification with Temporal-Bottleneck Enhancement

Sunmook Choi, Sanghyeok Chung, Seungeun Lee, Soyul Han, Taein Kang, Jaejin Seo, Il-Youp Kwak, Seungsang Oh

Published: 01 Jan 2024, Last Modified: 10 Jan 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper focuses on the transition of automatic speaker verification systems from time delay neural networks (TDNN) to ResNet-based networks. TDNN-based systems use a statistics pooling layer to aggregate temporal information which is suitable for two-dimensional tensors. Even though ResNet-based models produce three-dimensional tensors, they continue to incorporate the statistics pooling layer. However, the reduction in spatial dimensions in ResNet due to convolution operations, including the temporal axis, raises concerns about temporal information loss and its compatibility with statistics pooling. To address this, we introduce Temporal-Bottleneck ResNet (TB-ResNet), a ResNet-based system that can utilize the nature of statistics pooling more effectively by capturing and retaining frame-level contexts through a temporal bottleneck configuration in its building blocks. The performance of TB-ResNets outperforms the original ResNet counterparts on VoxCeleb1, achieving a significant reduction in both the equal error rate and the minimum detection cost function.

Loading