Abstract: In this work, we study the large-scale pretraining of BERT-Large with differentially private SGD (DP-SGD). We show that combined with a careful implementation, scaling up the batch size to millions (i.e., mega-batches) improves the utility of the DP-SGD step for BERT; we also enhance its efficiency by using an increasing batch size schedule. Our implementation builds on the recent work of Subramani et al, who demonstrated that the overhead of a DP-SGD step is minimized with effective use of JAX primitives in conjunction with the XLA compiler. Our implementation achieves a masked language model accuracy of 60.5% at a batch size of 2M, for epsilon = 5. To put this number in perspective, non-private BERT models achieve an accuracy of 70%
Paper Type: long
0 Replies
Loading