The effect of batch size on contrastive self-supervised speech representation learning

The effect of batch size on contrastive self-supervised speech representation learning

TMLR Paper3254 Authors

27 Aug 2024 (modified: 07 Dec 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Foundation models in speech are often trained using many GPUs, which implicitly leads to large effective batch sizes. In this paper we study the effect of batch size on pre-training with the contrastive method wav2vec 2.0, both in terms of statistics that can be monitored during pre-training, and in the effect on the performance of downstream fine-tuning tasks. By using batch sizes varying from 87.5 seconds to 80 minutes of speech we show that, for a fixed amount of iterations, larger batch sizes result in better pre-trained models, with an upper limit for effectiveness. We then show that the quality of the pre-trained model depends mainly on the amount of speech data seen during training, i.e., on the product of batch size and number of iterations. Our extensions can help researchers choose effective operating conditions when studying self-supervised learning in speech, and hint towards benchmarking self-supervision with a fixed amount of seen data. Code and model checkpoints are available at https://github.com/anonymous/available-after-review.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: 1. Removed a duplicate use of "all use" in the first two sentences of the introduction 2. Fixed a typo in Figure 5 and updates the x-axis labels to not include the 'K' suffix 3. Included a new Appendix A.2 which shows the data in Figure 1 with the amount of hours observed as x-axis instead of number of iterations 4. changed the abstract and RQ 1, 2 to be more specific to wav2vec 2.0 5. added section 4.5, where we show experimental results with a larger model and a different pre-training dataset 6. modified our conclusion section to be more specific to wav2vec 2.0, and rephrased it to more clear about the critical batch size not having an effect on performance. 7. added citations to linear and square root optimizer scaling laws in 1) hypothesis on RQ 3, 2) related work and 3) section 4.1.

Assigned Action Editor: ~Tatiana_Likhomanenko1

Submission Number: 3254

Loading