Abstract: With advancements in deep learning, speaker verification systems have significantly improved their performance in noisy environments. Researchers typically demonstrate the effectiveness of their improved models by comparing performance on specific datasets, such as the VoxCeleb benchmark. However, in diverse real-world noise conditions, the out-of-domain generalization ability is also a crucial factor in evaluating a model’s performance improvement. Research on stable learning indicates that eliminating the spurious correlation between training and testing data can enhance the generalization of the model. Building on this idea, we propose an improved speaker verification system with high generalization based on the extended U-Net (ExU-Net). It uses the sample reweighting method from stable learning to eliminate sample correlations and retains more effective speaker information through subpixel convolutions and coordinate attention mechanisms. We validate the effectiveness of this approach through extensive evaluations on VoxCeleb1, VOiCES, and other out-of-domain noise test sets, highlighting its generalization capability and model robustness.
External IDs:dblp:conf/icassp/WangF025
Loading