Sampling Rate Adaptive Speaker Verification from Raw Waveforms

Published: 01 Jan 2024, Last Modified: 10 Feb 2025ICPR (28) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The performance of a Speaker Verification (SV) system degrades substantially under a mismatched audio sampling rate (SR) between the training, testing, or deployment conditions. This can be addressed by model fine-tuning with resampled data, mixed-bandwidth training or bandwidth extension via generative modelling approaches. However, all existing SV models are typically designed to operate at a single sampling rate. This work presents a dynamic sampling rate filter-bank (DSR-FB) frontend for end-to-end SV systems. It employs multi-resolution convolutions with dynamic attention to learning at multiple scales. In particular, locally-consistent depthwise deformed convolutions are used to achieve SR dependent adaptive receptive field to focus on regions of interest in a coarse-to-fine manner. We demonstrate the effectiveness of DSR-FB on publicly available datasets where our best model achieves state-of-the-art performance both in closed-talk and far-field settings.
Loading