Student Author Indication: No
Keywords: speech recognition, transformer models, federated learning, differential privacy, large models
TL;DR: We compare prior works observations with federated learning and differential privacy in context of ASR
Abstract: While automatic speech recognition (ASR) has witnessed remarkable achievements in recent years, it has not garnered a widespread focus within the federated learning (FL) and differential privacy (DP) communities. Meanwhile, ASR is also a well-suited benchmark for FL and DP as there is (i) a natural data split across users by using speaker information; (ii) heterogeneous data across speakers close to practical settings; (iii) interplay between acoustic and language modeling; (iv) and it is a sequence-to-sequence task. Recent production-ready state-of-the-art models in ASR include *large* conformer and transformer models, the optimization of which is known to pose challenges even for central training. While the main trends and benchmarks in FL and DP focus on *small* models, we show the necessity of disentangling optimization and model size: the behavior of FL and DP for *large* models is different from the one for *small* models. We speculate that FL and DP are harder for *small* models due to harder optimization problems even in central training.
In this paper, we analyze the key FL parameters (optimizers, training from scratch or a seed model pre-trained centrally, cohort size, data heterogeneity) and propose *first* benchmark of *FL with DP* in the context of *large* models in ASR. We examine the applicability of prior results and present an overview of observed departures from the trends in prior works and from training different ASR models. Through this work, we provide researchers and practitioners in the fields of FL and DP with valuable insights into the fundamental differences that may arise when applying FL and DP research to large-scale ASR training.
Submission Number: 45
Loading