Normalized/Clipped SGD with Perturbation for Differentially Private Non-Convex Optimization

Normalized/Clipped SGD with Perturbation for Differentially Private Non-Convex Optimization

TMLR Paper3618 Authors

03 Nov 2024 (modified: 19 May 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: By ensuring differential privacy in the learning algorithms, one can rigorously mitigate the risk of large models memorizing sensitive training data. In this paper, we study two algorithms for this purpose, i.e., DP-SGD and DP-NSGD, which first clip or normalize \textit{per-sample} gradients to bound the sensitivity and then add noise to obfuscate the exact information. We analyze the convergence behavior of these two algorithms in the non-convex empirical risk minimization setting with two common assumptions and achieve a rate $\mathcal{O}\left(\sqrt[4]{\frac{d\log(1/\delta)}{N^2\epsilon^2}}\right)$ of the gradient norm for a $d$-dimensional model, $N$ samples and $(\epsilon,\delta)$-DP, which improves over previous bounds under much weaker assumptions. Specifically, we introduce a regularizing factor in DP-NSGD and show that it is crucial in the convergence proof and subtly controls the bias and noise trade-off. Our proof deliberately handles the per-sample gradient clipping and normalization that are specified for the private setting. Empirically, we demonstrate that these two algorithms achieve similar best accuracy while DP-NSGD is comparatively easier to tune than DP-SGD.

Submission Length: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: /forum?id=wLg9JrwFvL&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)

Changes Since Last Submission:

The previous AC has following major comments: Based on my own reading of the paper, I have some additional comments the authors should address:

Proposition E.1(iii) claims that all $(\epsilon, \delta)$-DP mechanisms are RDP as well. This is clearly wrong, because there exist $(\epsilon, \delta)$-DP mechanisms that are not DP for any $\delta' < \delta$ for any $\epsilon$, while RDP implies $\delta \to 0$ as $\epsilon \to \infty$.
The statements of Proposition E.1(iv) and (v) need to be clarified regarding the ambiguous "for any" - does that mean "for all" or "there exists"?
Corollary 3.5 suggests that the gradient norm could be made arbitrarily small by decreasing the clipping threshold $c$. Can you please explain how this is not in contradiction with the known lower bounds for DP learning? Can you please also explain why in your experimental results, larger $c$ leads to better performance?
I would appreciate some discussion on the effect of the smoothness parameters $(L_0, L_1)$ to the convergence bounds.
Please explain your use of the RDP accountant in the experiments in more detail. The references you cite do not provide theory for RDP accounting for subsampling without replacement with substitute adjacency that you are using in your algorithms.
In caption of Fig. 3: $\delta$ should probably not be $\exp(-5)$ as currently indicated.

In response to these comments, we have made the following changes:

On page 7, we add a paragraph in blue to discuss on the $(L_0,L_1)$-smoothness condition that we use.
On page 8, we add a paragraph in blue to discuss why Corollary 3.5 does not imply that $c$ should be chosen as small as possible.
On page 30, we have carefully rewrote Proposition E.1, which collects a few basic facts about sveral different notions of differntial privacy.
On page 11 right after the title of section 4, we add a paragraph in blue about privacy accountant.
Multiple typos corrected (including the caption of Figure 3).

Assigned Action Editor: Antti Honkela

Submission Number: 3618

Loading