Abstract: We introduce SpidR, a self-supervised speech representation model that efficiently learns strong representations for spoken language modeling. It is trained on unlabelled speech using a masked prediction objective combined with self-distillation and online clustering. The intermediate layers of the student model learn to predict assignments derived from the teacher intermediate layers. This learning objective stabilizes the online clustering procedure compared to previous approaches, resulting in higher-quality codebooks. SpidR outperforms previous state-of-the-art methods on downstream language modeling metrics while significantly reducing pretraining time, requiring only a day to pretrain with 16 GPUs instead of a week. We will open-source the training code and model checkpoints upon acceptance.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Tatiana_Likhomanenko1
Submission Number: 5256
Loading