Abstract: Speaker recognition is a key component for emerging Internet of Things (IoT) smart services, such as voice-control and personalized applications. Although speaker recognition systems can attain excellent performance on synthetic datasets, operation in the real-world can lead to a significant degradation in performance. The key reason for this is the lack of enough labeled datasets for model adaptation, primarily due to the cost of manual annotation and enrollment. A recent solution to this problem is to use cross-modal identifiers e.g. WiFi sniffing to gradually associate an identity with a certain vocal feature e.g. Simultaneous Clustering and Naming (SCAN). In this paper we demonstrate how to further improve performance of these cross-modal systems in the wild by iteratively adapting the feature extractor based on the output of the noisy association and clustering step. We show how this feedback loop can not only improve overall accuracy, but also labeling coverage in association result. iSCAN is a further step towards a robust and zero-effort speaker recognition system for the IoT.
0 Replies
Loading