Abstract: The objective of this work is speaker recognition under noisy and unconstrained conditions. We
make two key contributions. First, we introduce a very large-scale audio-visual dataset collected
from open source media using a fully automated pipeline. Most existing datasets for speaker
identification contain samples obtained under quite constrained conditions, and usually require
manual annotations, hence are limited in size. We propose a pipeline based on computer vision
techniques to create the dataset from open-source media. Our pipeline involves obtaining videos from YouTube; performing active speaker verification using a two-stream synchronization
Convolutional Neural Network (CNN), and confirming the identity of the speaker using CNN
based facial recognition. We use this pipeline to curate VoxCeleb which contains contains over
a million ‘real-world’ utterances from over 6000 speakers. This is several times larger than any
publicly available speaker recognition dataset. Second, we develop and compare different CNN
architectures with various aggregation methods and training loss functions that can effectively
recognise identities from voice under various conditions. The models trained on our dataset
surpass the performance of previous works by a significant margin
0 Replies
Loading