Abstract: We introduce SpidR, a self-supervised speech representation model that efficiently learns strong representations for spoken language modeling. It is trained on unlabelled speech using a masked prediction objective combined with self-distillation and online clustering. The intermediate layers of the student model learn to predict assignments derived from the teacher intermediate layers. This learning objective stabilizes the online clustering procedure compared to previous approaches, resulting in higher-quality codebooks. SpidR outperforms previous state-of-the-art methods on downstream language modeling metrics while significantly reducing pretraining time, requiring only a day to pretrain with 16 GPUs instead of a week. We will open-source the training code and model checkpoints upon acceptance.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - Added comparison between supervised evaluation and SLM evaluation of speech encoders at the end of the appendix (Figure 16).
- Added discussion of disentangled representations of non-linguistic components of speech in the conclusion
- We update our LM training pipeline to the latest version of fairseq2, and retrained with PyTorch flags for reproducibility on. This only results in slight variations of SLM scores in absolute value compared to the previous version.
Assigned Action Editor: ~Tatiana_Likhomanenko1
Submission Number: 5256
Loading