Abstract: Due to its superior performance and fewer parameters, CAM++ has become the state-of-the-art model for speaker verification tasks. This model uses 2D convolutional blocks to extract front-end features, which are then fed into a densely connected time-delay neural network backbone to extract deep features. However, the simple stacking of 2D convolutions may lead to the generation of a significant amount of redundant features, which is detrimental to efficient feature extraction. Furthermore, although CAM++ already has a relatively small number of parameters, there is still room for further optimization. To address these issues, this paper first employs depthwise separable convolutions to replace the dilated convolutions in the back-end network of CAM++, making the model more lightweight. Next, we introduce spatial and channel reconstruction convolution (SCConv) in the ResBlock module of CAM++ to reduce redundant features and optimize the feature extraction process. Finally, after SCConv, we apply squeeze and excitation attention mechanism to model the interdependencies between channels and recalibrate each channel, further enhancing the model's representational capacity. We name the resulting model LE-CAM++. Our proposed model achieves an EER of 0.686 and a minDCF of 0.084 on the VoxCeleb1–O dataset. Compared to the baseline model CAM ++, the EER is reduced by 11%, and the minDCF is reduced by 28%. Additionally, the model parameters are reduced by 8%.
Loading