%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} 
% after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

%%own package

\usepackage{smile}
%%my own definition
\newcommand{\shortsection}[1]{\vspace{1ex}\noindent{\bf #1.}}

\title{Efficient Privacy-Preserving Stochastic Nonconvex Optimization}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
%  Lingxiao Wang$^*$, Bargav Jayaraman$^\dag$, David Evans$^\dag$, Quanquan Gu$^\S$ \\[1.5ex]
% $*$: \emph{Toyota Technological Institute at Chicago} \\
%  {\small \sf lingxw@ttic.edu} \\[1.2ex]
%  $\dag$: \emph{Department of Computer Science, University of Virginia} \\
%  {\small \sf [bj4nq, evans]@virginia.edu} \\[1.2ex]
%   $\S$: \emph{Department of Computer Science, University of California, Los Angeles} \\
%  {\small \sf qgu@cs.ucla.edu} 
\author[1]{{Lingxiao Wang}{}}
\author[2]{Bargav Jayaraman}
\author[2]{David Evans}
\author[3]{Quanquan Gu}
% Add affiliations after the authors
\affil[1]{%
    Toyota Technological Institute at Chicago
}
\affil[2]{%
Department of Computer Science, University of Virginia
}
\affil[3]{%
Department of Computer Science, University of California, Los Angeles
  }
  
  \begin{document}
\maketitle

\begin{abstract}
While many solutions for privacy-preserving convex empirical risk minimization (ERM) have been developed, privacy-preserving nonconvex ERM remains a challenge. We study nonconvex ERM, which takes the form of minimizing a finite-sum of nonconvex loss functions over a training set. We propose a new differentially private stochastic gradient descent algorithm for nonconvex ERM that achieves strong privacy guarantees efficiently, and provide a tight analysis of its privacy and utility guarantees, as well as its gradient complexity. Our algorithm reduces gradient complexity while matching the best-known utility guarantee. Our experiments on benchmark nonconvex ERM problems demonstrate superior performance in terms of both training cost and utility gains compared with previous differentially private methods using the same privacy budgets.
\end{abstract}

\section{Introduction}\label{sec:intro}
For many important domains such as health care and medical research, the datasets used to train machine learning models contain sensitive personal information. There is a risk that models trained on this data can reveal private information about individual records in that training data~\citep{fredrikson2014privacy,shokri2017membership,carlini2018secret}. 
This motivates the research on privacy-preserving machine learning, much of which has focused on achieving \emph{differential privacy}~\citep{dwork2006calibrating}, a rigorous definition of privacy that provides statistical data privacy for individual records. In the past decade, many differentially private machine learning algorithms for solving the empirical risk minimization (ERM) problem 
have been proposed (e.g., \citep{chaudhuri2011differentially,kifer2012private,bassily2014private,zhang2017efficient,wang2017differentially,jayaraman2018distributed,wang2019differentiallyhd,wang2019knowledge}). Almost all of these are for ERM with convex loss functions, 
but many important machine learning approaches, including deep learning, are formulated as ERM problems with nonconvex loss functions.
Furthermore, these learning problems often involve large training sets, necessitating the use of stochastic optimization algorithms such as stochastic gradient descent (SGD). 


Several recent studies have advanced the application of differential privacy in deep learning~\citep{abadi2016deep,papernot2016semi,brendan2018learning,bu2019deep}. While these studies prove differential privacy is satisfied, they evaluate utility experimentally. Only a few differentially private algorithms for solving nonconvex optimization problems have proven utility bounds \citep{zhang2017efficient,wang2017differentially}. For example, \citet{wang2017differentially} proposed a differentially private gradient descent (DP-GD) algorithm
with both privacy and utility guarantees. However, each iteration of DP-GD requires computing the full gradient, which makes it too expensive for use on large training sets. \citet{zhang2017efficient} proposed a random round private stochastic gradient descent (RRPSGD) that can achieve the same privacy guarantee as DP-GD with reduced runtime complexity but with slightly worse utility bounds. 
In this paper, we propose a differentially private Stochastic Recursive Momentum (DP-SRM) algorithm for nonconvex ERM.  
At the core of our algorithm is the stochastic recursive momentum technique \citep{cutkosky2019momentum} that can consistently reduce the accumulated variance of the gradient estimator. Our approach is more scalable than stochastic variance-reduced algorithms \citep{johnson2013accelerating,reddi2016stochastic,allen2016variance,lei2017non,nguyen2017sarah,fang2018spider,zhou2018stochastic} since it eliminates the periodical computation of the checkpoint gradient which usually requires a giant batch size. A recent work \citep{arora2022faster} developed a differentially private variant of the stochastic variance-reduced algorithm \citep{wang2019spiderboost} called Private SpiderBoost. While Private SpiderBoost can achieve the same utility guarantee as our proposed DP-SRM algorithm, Private SpiderBoost requires periodic full gradient computation, making it
less scalable and results in worse gradient complexity.

\shortsection{Contributions} The main contributions of our paper are summarized as follows:
\begin{itemize}[leftmargin=*]
    \item We develop a new differentially private stochastic optimization algorithm for nonconvex ERM and provide a sharp analysis of the privacy guarantee using R\'enyi Differential Privacy (RDP) \citep{mironov2017renyi}.
    \item Our algorithm improves the previous best-known utility guarantee for nonconvex optimization with lower computational complexity. The utility guarantee of our algorithm is $O\big((d\log(1/\delta))^{1/3}/(n\epsilon)^{2/3}\big)$\footnote{A recent work \citep{arora2022faster} also achieves this utility guarantee with a worse gradient complexity.}, which is better than the previous best-known results of $O\big((d\log(1/\delta))^{1/4}/(n\epsilon)^{1/2}\big)$ established in \citet{wang2017differentially}. The gradient complexity (i.e., the number of stochastic gradients calculated in total) of our algorithm is $O\big((n\epsilon)^{2}/(d\log(1/\delta))\big)$, which outperforms the best previous results \citep{zhang2017efficient,wang2017differentially} when the problem dimension $d$ is large (see Table \ref{table:Comparision} for more details).
    \item We evaluate our proposed methods on two nonconvex ERM techniques: nonconvex logistic regression and convolutional neural networks.  We report on experiments on several benchmark datasets (Section~\ref{sec:exp}), finding that our method not only produces models that are the closest to the non-private models in terms of model accuracy but also reduces the computational cost.
\end{itemize}

% We develop a new differentially private stochastic optimization algorithm for nonconvex ERM and provide a sharp analysis of the privacy guarantee using R\'enyi Differential Privacy (RDP) \citep{mironov2017renyi} (Section~\ref{sec:proof_outline}). Our algorithm improves the best-known utility guarantee for nonconvex optimization, with lower computational complexity. The utility guarantee of our algorithm is $O\big((d\log(1/\delta))^{1/3}/(n\epsilon)^{2/3}\big)$, which is better than the best-known results $O\big((d\log(1/\delta))^{1/4}/(n\epsilon)^{1/2}\big)$ established in \citet{wang2017differentially}. The gradient complexity (i.e., the number of stochastic gradients calculated in total) of our algorithm is $O\big((n\epsilon)^{2}/(d\log(1/\delta))\big)$, which outperforms the best previous results \citep{zhang2017efficient,wang2017differentially} when the problem dimension $d$ is large (see Table \ref{table:Comparision} for more details).
% We evaluate our proposed methods on two nonconvex ERM techniques: nonconvex logistic regression and convolutional neural networks.  We report on experiments on several benchmark datasets (Section~\ref{sec:exp}), finding that our method not only produces models that are the closest to the non-private models in terms of model accuracy but also reduces the computational cost. %\dnote{surprised by this claim - earlier, we were just claiming to match accuracy results, but reduce cost? are we able to focus on improving accuracy, or do experimental results only support cost reduction?}\lnote{experimental results show both improved accuracy (although slightly) and reduced cost, I revised it a little bit.}

\noindent\textbf{Notation.}
We use curly symbol such as $\cB$ to denote the index set. For a set $\cB$, we use $|\cB|$ to denote its cardinality. For a finite sum function $F=\sum_{i=1}^nf_i/n$, we denote $F_{\cB}$ by $\sum_{i\in\cB}f_i/|\cB|$. For a $d$-dimensional vector $\xb\in\RR^d$, we use $\|\xb\|_2$ to denote its $\ell_2$-norm. Given two sequences $\{a_n\}$ and $\{b_n\}$, if there exists a constant $0<C<\infty$ such that $a_n\leq Cb_n$, we write $a_n = O(b_n)$. Besides, if there exist constants $0<C_1,C_2<\infty$ such that $C_1b_n\leq a_n\leq C_2b_n$, we write $a_n=\Theta(b_n)$. We use $n$, $d$ to represent the number of training examples and the problem dimension, respectively. We also use the standard notation for $(\epsilon,\delta)$-DP where $\epsilon$ is the privacy budget and $\delta$ is the failure probability.

\section{Related Work}\label{sec:related}
Over the past decade, many differentially private machine learning algorithms for convex ERM have been proposed. There are three main approaches to achieve differential privacy in such settings, including output perturbation~\citep{wu2017bolt,zhang2017efficient}, objective perturbation~\citep{chaudhuri2011differentially,kifer2012private,iyengartowards}, and gradient perturbation~\citep{bassily2014private,wang2017differentially,jayaraman2018distributed}. However, other than the methods using gradient perturbation, it is very hard to generalize these methods to nonconvex ERM because of the difficulty in computing the sensitivity for nonconvex ERM. 
Thus, most differentially private algorithms for nonconvex ERM are based on the gradient perturbation, including our work. The problem with gradient perturbation approaches is that their iterative nature quickly consumes any reasonable privacy budget. Hence, the main challenge is to develop algorithms for nonconvex ERM that can provide sufficient utility while maintaining privacy with high computational efficiency.

Several recent works~\citep{abadi2016deep,papernot2016semi,xie2018differentially} studied deep learning with differential privacy. \citet{abadi2016deep} proposed a method called moments accountant to keep track of the privacy cost of stochastic gradient descent algorithm during the training process, which provides a strong privacy guarantee.
\citet{papernot2016semi} established a Private Aggregation of Teacher Ensembles (PATE) framework to improve the privacy guarantee of deep learning for classification tasks.  \citet{xie2018differentially} and \citet{yoon2018pate} investigated the differentially private Generative Adversarial Nets (GAN) with different distance metrics. However, none of these works provide utility guarantees for their algorithms.

% Several recent works~\citep{abadi2016deep,papernot2016semi,papernot2018scalable,brendan2018learning,bu2019deep} studied deep learning  with differential privacy. For example, \citet{abadi2016deep} proposed the moments accountant to keep track of the privacy cost of stochastic gradient descent during the training process, which provides a strong privacy guarantee. However, none of these works provide utility guarantees for their algorithms. 
%\dnote{is it necessary to mention the GAN works here? there are lots of more closely rated deep learning privacy papers to cite instead of these}\lnote{I remove the GAN part and add several new citations for deep learning}
% \citet{abadi2016deep} proposed a method called moments accountant to keep track of the privacy cost of stochastic gradient descent during the training process, which provides a strong privacy guarantee.
% % i.e., a tight bound on the magnitude of the adding noise to ensure differential privacy. 
% \citet{papernot2016semi} established a Private Aggregation of Teacher Ensembles (PATE) framework to improve the privacy guarantee of deep learning for classification tasks.  \citet{xie2018differentially} and \citet{yoon2018pate} investigated the differentially private Generative Adversarial Nets (GAN) with different distance metrics. However, none of these works provide utility guarantees for their algorithms.
% \vspace{-0.1in}

\begin{table*}[tb]

    \caption{Comparison of different $(\epsilon,\delta)$-DP algorithms for nonconvex optimization. We report the utility bound in terms of $\EE\|\nabla F(\btheta^p)\|_2$, where $\btheta^p$ is the output of the differentially private algorithm, $\EE$ is taken over the randomness of the algorithm. We only present results in terms of $n,d,\epsilon,\delta$ and ignore other parameters for simplicity. *Although Private SpiderBoost and DP-SRM have the same utility guarantee, Private SpiderBoost requires periodic full gradient computation, which makes it less scalable and results in worse gradient complexity.}
	\begin{center}
		\begin{tabular}{cccccc}
			\toprule
			Algorithm  & Utility &Gradient Complexity\\
			\midrule
						RRPSGD \citep{zhang2017efficient} & $O\Big(\frac{(d\log(n/\delta)\log(1/\delta))^{1/4}}{(n\epsilon)^{1/2}}\Big)$&$O\big(n^2\big)$\\
			\midrule
			DP-GD \citep{wang2017differentially}&$O\Big(\frac{(d\log(1/\delta))^{1/4}}{(n\epsilon)^{1/2}}\Big)$&$O\Big(\frac{n^2\epsilon}{(d\log(1/\delta))^{1/2}}\Big)$\\
			\midrule
   			Private SpiderBoost \citep{arora2022faster}&$O\Big(\frac{(d\log(1/\delta))^{1/3}}{(n\epsilon)^{2/3}}\Big)$&$O\Big(\frac{(n\epsilon)^{2}}{d\log(1/\delta)}+\frac{n^{5/3}\epsilon^{2/3}}{(d\log(1/\delta))^{1/3}}\Big)$\\
			\midrule
            \textbf{DP-SRM} 
			 &\multirow{2}{*}{$O\Big(\frac{(d\log(1/\delta))^{1/3}}{(n\epsilon)^{2/3}}\Big)$}& \multirow{2}{*}{$O\Big(\frac{(n\epsilon)^{2}}{d\log(1/\delta)}\Big)$}\\(This paper)&&&&&\\
			\bottomrule
		\end{tabular}
		\label{table:Comparision}
	\end{center}
\end{table*}
Table \ref{table:Comparision} summarizes differentially private nonconvex optimization algorithms 
that provide utility guarantees for nonconvex ERM. The Random Round Private Stochastic Gradient Descent (RRPSGD) method developed by \citet{zhang2017efficient} is the first differentially private nonconvex optimization algorithm with the utility guarantee. %\dnote{why isn't this one first in Table 1?}\lnote{fixed} 
This method performs the perturbed SGD (adding Gaussian noise to the stochastic gradients), for a random number of iterations \citep{ghadimi2013stochastic}. The gradient complexity of RRPSGD is $O(n^2)$, which makes it impractical for most settings. 
%\dnote{which makes it impractical for non-toy tasks?}\lnote{how about "which is computational inefficient"}. 
\citet{zhang2017efficient} showed that RRPSGD is able to find a stationary point in expectation with a diminishing error $O\big((d\log(n/\delta)\log(1/\delta))^{1/4}/(n\epsilon)^{1/2}\big)$. Their analysis of the privacy guarantee is based on the standard privacy-amplification 
by subsampling result and strong composition theorem \citep{bassily2014private}. Although such an analysis can be easily adapted to the nonconvex setting with stochastic optimization algorithms, it results in a large bound on the variance of the added noise compared with relaxed definitions such as the moments accountant~\citep{abadi2016deep} and Gaussian differential privacy \citep{dong2019gaussian}. 
%\dnote{or GDP?}\lnote{add}  



%\dnote{should be second in Table 1?}\lnote{fixed}
\citet{wang2017differentially} proposed the Differentially Private Gradient Descent (DP-GD) algorithm for nonconvex optimization. DP-GD achieves an improved utility guarantee of $O\big((d\log(1/\delta))^{1/4}/(n\epsilon)^{1/2}\big)$ compared to that of RRPSGD, with a reduced gradient complexity of $O\big(n^2\epsilon/(d\log(1/\delta))^{1/2}\big)$. The reason DP-GD can achieve this factor of $O\big((\log(n/\delta))^{1/4}\big)$ improvement, is that it uses the full gradient rather than the stochastic gradient. This makes DP-GD computationally very expensive or even intractable for large-scale machine learning problems ($n$ is big). Recently, \citet{wang2019differentially} also proposed a differentially private stochastic algorithm for nonconvex optimization. 
%\dnote{should this be in Table 1 also (but with a note about this?}\lnote{I don't think so since their results are asymptotic}  okay
Their goal is to find the local minima, while we aim to find the stationary point. In addition, their utility guarantee is asymptotic---it provides the desired utility guarantee only if an infinite number of iterations could be run. In contrast, our utility guarantee holds for a finite number of iterations. 

Recently, \citet{arora2022faster} developed a Private SpiderBoost algorithm for noncovex optimization, which achieves an improved utility guarantee of $O\big((d\log(1/\delta))^{1/3}/(n\epsilon)^{2/3}\big)$ with  gradient complexity of $O\big((n\epsilon)^2/(d\log(1/\delta))+n^{5/3}\epsilon^{2/3}/(d\log(1/\delta))^{1/3}\big)$. Although Private SpiderBoost attains the same improved utility guarantee as our method, it requires periodic full gradient computation, making it
less scalable and results in worse gradient complexity when $d\geq O\big(\sqrt{n}\epsilon^2/\log(1/\delta)\big)$. 


\section{Preliminaries}\label{sec:preliminary}
We consider the empirical risk minimization (ERM) problem: given a training set $S=\{(\xb_1,y_1),\ldots,(\xb_n,y_n)\}$ drawn from some unknown but fixed data distribution with $\xb_i \in \RR^D, y_i \in \cY \subseteq \RR$, we aim to find a solution $\hat \btheta \in \RR^d$ that minimizes the following empirical risk
\begin{align}\label{eq:finite_sum}
\min_{\btheta\in\RR^d}F(\btheta):=\frac{1}{n}\sum_{i=1}^nf_i(\btheta),
\end{align}
% \begin{align}\label{eq:finite_sum}
% F(\btheta):=n^{-1}\textstyle{\sum_{i=1}^n}f_i(\btheta),
% \end{align}
where $F(\btheta)$ is the empirical risk function (i.e., training loss), $f_i(\btheta) = \ell(\btheta;\xb_i,y_i)$ is the loss function defined on the $i$-th training example $(\xb_i,y_i)$, and $\btheta \in \RR^d$ is the model parameter we want to learn. 


Here, we provide some definitions and lemmas that will be used in our theoretical analysis. 
% First, a function $f:\RR^d\rightarrow\RR$ is $G$-Lipschitz, if for all $\btheta_1,\btheta_2\in\RR^d$, $	|f(\btheta_1)- f(\btheta_2)|\leq G\|\btheta_1-\btheta_2\|_2$, and it has $L$-Lipschitz gradient, if for all $\btheta_1,\btheta_2\in\RR^d$, $	\|\nabla f(\btheta_1)- \nabla f(\btheta_2)\|_2\leq L\|\btheta_1-\btheta_2\|_2.$

\begin{definition}
    $\btheta\in \RR^d$ is an $\zeta$-approximate stationary point if $\|\nabla f(\btheta)\|_2\leq \zeta$.
\end{definition}

\begin{definition}
	A function $f:\RR^d\rightarrow\RR$ is $G$-Lipschitz, if for all $\btheta_1,\btheta_2\in\RR^d$, we have 
	% $	|f(\btheta_1)- f(\btheta_2)|\leq G\|\btheta_1-\btheta_2\|_2.$
	\begin{align*}
	    |f(\btheta_1)- f(\btheta_2)|\leq G\|\btheta_1-\btheta_2\|_2.
	\end{align*}
\end{definition}

\begin{definition}
	A function $f:\RR^d\rightarrow\RR$ has $L$-Lipschitz gradient, if for all $\btheta_1,\btheta_2\in\RR^d$, we have
	% $	\|\nabla f(\btheta_1)- \nabla f(\btheta_2)\|_2\leq L\|\btheta_1-\btheta_2\|_2.$
	\begin{align*}
	\|\nabla f(\btheta_1)- \nabla f(\btheta_2)\|_2\leq L\|\btheta_1-\btheta_2\|_2.
	\end{align*}
\end{definition}

% \begin{definition}
% 	A function $F:\RR^d\rightarrow\RR$ with finite sum structure in \eqref{eq:finite_sum} has stochastic gradients
% 	with bounded variance $\gamma^2$, if for all $\btheta\in\RR^d$, we have 
% 	\begin{align*}
% 	\EE_i\|\nabla f_i(\btheta)-\nabla F(\btheta)\|_2\leq \gamma^2,
% 	\end{align*}
% 	where $\EE_i$ is taken over $i$, and $i$ is a random index uniformly chosen from $[n]$.
% \end{definition}

%\subsection{Differential Privacy}

%We also introduce several notions of differential privacy. 
Differential privacy provides a formal notion of privacy, introduced by
\citet{dwork2006calibrating}:

\begin{definition}[$(\epsilon, \delta)$-DP~\citep{dwork2006calibrating}]
	A randomized mechanism $\cM:\cS^n\rightarrow\cR$ satisfies $(\epsilon,\delta)$-differential privacy if for any two adjacent data
	sets $S,S'\in \cS^n$ differing by one element, and any output subset $O\subseteq \cR$, it holds that 
	$\PP[\cM(S)\in O]\leq e^\epsilon\cdot \PP[\cM(S')\in O]+\delta$.
	% \begin{align*}
	% \PP[\cM(S)\in O]\leq e^\epsilon\cdot \PP[\cM(S')\in O]+\delta.
	% \end{align*}
\end{definition}
 To achieve $(\epsilon, \delta)$-DP for a given function $q:\cS^n\rightarrow\cR$, we can use Gaussian mechanism \citep{dwork2014algorithmic} $\cM=q(S)+\ub$, where $\ub$ is a standard Gaussian random vector with variance that is proportional to the $\ell_2$-sensitivity of the function $q$, $\Delta(q)$, which is defined as follows.

\begin{definition}[$\ell_2$-sensitivity\citep{dwork2014algorithmic}]
	For two adjacent datasets $S,S'\in \cS^n$ differing by one element, the $\ell_2$-sensitivity $\Delta(q)$ of a function $q:\cS^n\rightarrow\cR$ is defined as 
	$\Delta(q)=\sup_{S,S'}\|q(S)-q(S')\|_2$.
% \begin{align*}
%     \Delta(q)=\sup_{S,S'}\|q(S)-q(S')\|_2.
% \end{align*}
\end{definition}
%\dnote{do we actually use this?}\lnote{yes, we use RDP-based analysis to get the tight privacy and utility bound}\dnote{I mean the $\ell-2$ sensitivity}\lnote{yes, I think most of the algorithms need $\ell_2$ sensitivity to get their results.}\dnote{okay - if we need to save space, we only use it explicitly in the proofs in App B, so could move there}

\noindent\textbf{R\'enyi differential privacy.} Although the notion of $(\epsilon,\delta)$-DP is widely used in the output and objective perturbation methods, it suffers from the loose composition and privacy-amplification by subsampling results, which makes it unsuitable for the stochastic iterative learning algorithms. In this work, we will make use of the notion of R\'enyi Differential Privacy (RDP) \citep{mironov2017renyi} which is particularly useful when
the dataset is accessed by a sequence of randomized mechanisms \citep{wang2018subsampled}. 


\begin{definition}[RDP~\citep{mironov2017renyi}]\label{def:rdp}
	For $\alpha>1,\rho>0$, a randomized mechanism $\cM:\cS^n\rightarrow\cR$ is $(\alpha, \rho)$-R\'enyi Differential Privacy, if for all adjacent datasets $S, S^\prime \in \cS^n$ differing by one element, we have  
	$	D_{\alpha}\big(\cM(S)||\cM(S^\prime)\big):=\log \EE\big[\big(\cM(S)/\cM(S^\prime)\big)^\alpha\big]/(\alpha-1)\leq \rho$.
	% \begin{align*}
	%  D_{\alpha}\big(\cM(S)||\cM(S^\prime)\big):=\log \EE\big[\big(\cM(S)/\cM(S^\prime)\big)^\alpha\big]/(\alpha-1)\leq \rho.
	% \end{align*}
	%where the expectation is taken over $\cM(S^\prime)$.
\end{definition}

By Definition \ref{def:rdp}, RDP measures the ratio of probability distributions $\cM(S)$ and $\cM(S^\prime)$ by $\alpha$-order Renyi Divergence with $\alpha\in (1,\infty)$. As $\alpha\rightarrow\infty$, RDP reduces to $\epsilon$-DP. 

To further improve the privacy guarantee when using the Gaussian mechanisms to satisfy RDP, we establish the following privacy-amplification by subsampling result, which is derived based on the result in \citep{wang2018subsampled}.


\begin{lemma}
\label{lemma:GaussianM_RDP}
	Given a function $q:\cS^n\rightarrow\cR$, the Gaussian Mechanism $\cM=q(S)+\ub$, where $\ub\sim N(0,\sigma^2\Ib)$, satisfies $(\alpha,\alpha\Delta^2(q)/(2\sigma^2))$-RDP. In addition, if we apply the mechanism $\cM$ to a subset of samples using uniform sampling without replacement with sampling rate $\tau$, $\cM$ satisfies $(\alpha, 3.5\tau^2\Delta^2(q)\alpha/\sigma^2)$-RDP given $\sigma^{\prime2}=\sigma^2/\Delta^2(q)\geq 0.7$, $\alpha\leq 2\sigma^2\log(1/\big(\tau\alpha\big(1+\sigma^{\prime2})\big)\big)/3+1$.
\end{lemma}
\begin{remark}[\textit{Comparison with moment accountant}]\label{remark:sub_amp}
    Suppose $\Delta(q)=1$, Lemma \ref{lemma:GaussianM_RDP} suggests that to achieve $(\alpha, 3.5\tau^2\alpha/\sigma^2)$-RDP of the subsampled Gaussian mechanism, we require $\sigma^2\geq 0.7$. For the moment accountant based method \citep{abadi2016deep}, it can achieve the asymptotic privacy guarantee of $\big(\alpha, \tau^2\alpha/(1-
    \tau)\sigma^2+O(\tau^3\alpha^3/\sigma^3)\big)$-RDP when $\tau$ goes to zero and $\sigma^2\geq 1$, $\alpha\leq \sigma^2\log (1/(\tau\sigma))$. In contrast to moment accountant, our result has a closed-form bound on the privacy guarantee and a relaxed requirement of $\sigma^2$. 
\end{remark}
    It is worth noting that there exist some other works \citep{mironov2019r,zhu2019poission} also studying the privacy-amplification by subsampling results. However, they consider the Poisson subsampling approach, which is different from our uniform subsampling method.
% \begin{remark}\label{remark:sub_amp}
%     Suppose $\Delta(q)=1$, Lemma \ref{lemma:GaussianM_RDP} suggests that to achieve $(\alpha, 3.5\tau^2\alpha/\sigma^2)$-RDP of the subsampled Gaussian mechanism, we need the conditions that $\sigma^2\geq 0.7$, $\alpha\leq 2\sigma^2\log(1/\tau\alpha \big(1+\sigma^2)\big)/3+1$. For the moment accountant based method \citep{abadi2016deep}, it can achieve the asymptotic privacy guarantee of $\big(\alpha, \tau^2\alpha/(1-
%     \tau)\sigma^2+O(\tau^3\alpha^3/\sigma^3)\big)$-RDP when $\tau$ goes to zero and $\sigma^2\geq 1$, $\alpha\leq \sigma^2\log (1/\tau\sigma)$. In contrast to moment accountant, our result has a closed-form bound on the privacy guarantee and a relaxed requirement of $\sigma^2$. It is worth noting that there exist some other works \citep{mironov2019r,zhu2019poission} also studying the privacy-amplification by subsampling results. However, they consider the Poisson subsampling approach, which is different from our uniform subsampling method. 
% \end{remark}

Based on Lemma \ref{lemma:GaussianM_RDP}, we can establish a strong privacy guarantee of our method in terms of RDP, and then transfer it to $(\epsilon,\delta)$-DP using the following lemma.
\begin{lemma}[\citet{mironov2017renyi}]\label{lemma:RDP_to_DP}
	If a randomized mechanism $\cM: \cS^n\rightarrow\cR$ satisfies $(\alpha,\rho)$-RDP, then $\cM$ satisfies $(\rho+\log(1/\delta)/(\alpha-1),\delta)$-DP for all $\delta\in(0,1)$.
\end{lemma}

% We also have the following composition rule for RDP.

% \begin{lemma}[\citet{mironov2017renyi}]\label{lemma:com_post}
% 	If $k$ randomized mechanisms $\cM_i:\cS^n\rightarrow\cR$ for $i\in[k]$, satisfy $(\alpha,\rho_i)$-RDP, then their composition $\big(\cM_1(S),\ldots,\cM_k(S)\big)$ satisfies $(\alpha,\sum_{i=1}^k\rho_i)$-RDP. Moreover, the input of the $i$-th mechanism can base on the outputs of previous $(i-1)$ mechanisms.
% \end{lemma}

% Based on Lemmas \ref{lemma:GaussianM_RDP} and \ref{lemma:com_post}, we will use the RDP-based analysis to establish a strong privacy guarantee of our algorithms. Finally, we have the following relationship between RDP and $(\epsilon,\delta)$-DP.
% \begin{lemma}[\citet{mironov2017renyi}]\label{lemma:RDP_to_DP}
% 	If a randomized mechanism $\cM: \cS^n\rightarrow\cR$ satisfies $(\alpha,\rho)$-RDP, then $\cM$ satisfies $(\rho+\log(1/\delta)/(\alpha-1),\delta)$-DP for all $\delta\in(0,1)$.
% \end{lemma}

\section{Algorithm}\label{sec:alg}
Our proposed algorithm for differentially private nonconvex ERM, is illustrated in Algorithm \ref{alg:DPSRM}. 

\begin{algorithm}[!thp]
	\caption{Differentially Private Stochastic Recursive Momentum (DP-SRM)} \label{alg:DPSRM}
	\begin{algorithmic}[1]
		\INPUT $\btheta^0,T,G,L,\gamma,\beta,n_0$, privacy parameters $\epsilon,\delta$, accuracy for the first-order stationary point $\zeta$  
		\STATE Uniformly sample $b_0$ examples without replacement indexed by $\cB_0$
		\STATE Compute $\vb^0=\nabla F_{\cB_0}(\btheta^0)$, where $\nabla F_{\cB_{0}}(\btheta^{0})=\sum_{i\in\cB_{0}}\nabla f_i(\btheta^{0})/b_0$, draw $\ub^{0}\sim N(0,\sigma^2_0\Ib_d)$ with $\sigma^2_0=14TG^2\alpha/(\beta n^2\epsilon)$, $\alpha=\log(1/\delta)/\big((1-\beta)\epsilon\big)+1$
		\STATE Release the differentially private gradient estimator $\vb_p^{0}=\vb^{0}+\ub^{0}$
		\FOR{$t=0,1,2,\ldots, T-1$}
		\STATE $\btheta^{t+1}=\btheta^{t}-\eta_t\vb^t_p$, where $\eta_t=\min\big\{\zeta/(n_0L\|\vb^t_p\|_2),1/(2n_0L)\big\}$ 
		\STATE Uniformly sample $b$ examples without replacement indexed by $\cB_{t+1}$
		\STATE Compute $\vb^{t+1}=\nabla F_{\cB_{t+1}}(\btheta^{t+1})+(1-\gamma)\big(\vb^{t}_p-\nabla F_{\cB_{t+1}}(\btheta^{t})\big)$, draw $\ub^{t+1}\sim N(0,\sigma^2\Ib_d)$ with $\sigma^2=14T\big((1-\gamma)\zeta/n_0+\gamma G\big)^2\alpha/(\beta n^2\epsilon)$, $\alpha=\log(1/\delta)/\big((1-\beta)\epsilon\big)+1$
		\STATE Release the differentially private gradient estimator $\vb_p^{t+1}=\vb^{t+1}+\ub^{t+1}$
		\ENDFOR
		\OUTPUT $\tilde \btheta$ chosen uniformly at random from $\{\btheta^t\}_{t=0}^{T-1}$
	\end{algorithmic}
\end{algorithm}

The main idea is to construct the differentially private gradient estimator $\vb^t_p$ iteratively based on the information obtained from the previous updates. We initialize $\vb^0$ to be the mini-batch stochastic gradient $\nabla F_{\cB_0}(\btheta^0)$ and inject Gaussian noise, $\ub^0$,  with covariance matrix $\sigma^2_0 \Ib_d$ (lines 2, 3), to make it differentially private. Then, we recursively update $\vb^t$ (line 7) as $\vb^t=\nabla F_{\cB_{t}}(\btheta^{t})+(1-\gamma)\big(\vb^{t-1}_p-\nabla F_{\cB_{t}}(\btheta^{t-1})\big)$, where $\nabla F_{\cB_{t}}(\btheta^t)$, $\nabla F_{\cB_{t}}(\btheta^{t-1})$ are mini-batch stochastic gradients and $\vb^{t-1}_p$ is the private gradient estimator released at the last iteration. 
The momentum parameter, $\gamma$, is used to control the decay rate of the prior information, $\vb^{t-1}_p-\nabla F_{\cB_{t}}(\btheta^{t-1})$. This is called stochastic recursive momentum \citep{cutkosky2019momentum}, which can lead to fast convergence.
%in the term with the $(1-\gamma)$ coefficient to denote the exponential average of the past information.
%\dnote{I'm confused by the previous sentence - tried to reword, but don't like the "denote" in it - can we explain more simply what this term means} \lnote{How about  "historical information"? I have revised this sentence}
After updating $\vb^t$, we again inject Gaussian noise $\ub^t$ with covariance matrix $\sigma_2\Ib_d$ (line 8), to provide differential privacy. The variance $\sigma^2_0$, $\sigma^2$ of the Gaussian random vectors are determined by our RDP-based analysis. We choose an adaptive step size (line 5) to bound the sensitivity of the gradient estimator $\vb^t_p$, which is the key to establish the tight privacy and utility guarantees (Section \ref{sec:proof_outline}) of our algorithm. 
%and differs from the non-private stochastic recursive momentum algorithm \citep{cutkosky2019momentum,yuan2020stochastic}.

% to the current full gradient every $l$ iterations, and inject Gaussian noise $\ub^t$  with covariance matrix $\sigma^2_1 \Ib_d$ into $\vb^t$, i.e., line 3, to make it differentially private. %and release the private counterpart $\vb^t_p$. 
% In the subsequent $(l-1)$ iterations, we recursively update $\vb^t$, i.e., line 7, as $\vb^t=\nabla F_{\cB_t}(\btheta^t)-\nabla F_{\cB_t}(\btheta^{t-1})+\vb^{t-1}_p$,
% % \begin{align*}
% % \vb^t=\nabla F_{\cB_t}(\btheta^t)-\nabla F_{\cB_t}(\btheta^{t-1})+\vb^{t-1}_p,
% % \end{align*}
% where $\nabla F_{\cB_{t}}(\btheta^t)$ and $\nabla F_{\cB_{t}}(\btheta^{t-1})$ are mini-batch stochastic gradients, and $\vb^{t-1}_p$ is the private gradient estimator released at the last iteration. 
% %\dnote{what is it for iterations $1, ..., l$ ?}\lnote{I mean the subsequent $l-1$ iterations, start from $t$ such that $\mod(t,l)=0$}
% After updating $\vb^t$, we again inject Gaussian noise $\ub^t$ with covariance matrix $\sigma_2^2\Ib_d$, i.e., line 9, to make it differentially private. The variance $\sigma^2_1$ and $\sigma^2_2$ of the Gaussian random vectors are determined by our later RDP-based analysis.

\section{Main Theoretical Results}\label{sec:results}
In this section, we establish formal privacy and utility guarantees for Algorithm~\ref{alg:DPSRM}. 

% Our proof involves new
% techniques for the privacy and utility guarantees that are of general use for variance reduction-based algorithms. 

% In particular, to establish the privacy guarantee, we need to characterize the sensitivity of the gradient estimator $\vb^{t}$ for $t>0$, which is determined by the following quantity (the proof is in Appendix \ref{proof})
% \begin{align}\label{eq:sentivity}
%   &(1-\gamma)\|\nabla f_i(\btheta^{t})-\nabla f_{i}(\btheta^{t-1})+\nabla f_{i^\prime}(\btheta^{t})-\nabla f_{i^\prime}(\btheta^{t-1})\|_2\nonumber\\
%   &\quad+\gamma\|\nabla f_i(\btheta^{t})-\nabla f_{i^\prime}(\btheta^{t})\|_2.  
% \end{align}
%     For the second term in \eqref{eq:sentivity}, by $G$-Lipschitz of $f_i$ and $f_{i^\prime}$, we have $\gamma\|\nabla f_i(\btheta^{t})-\nabla f_{i^\prime}(\btheta^{t})\|_2\leq 2\gamma G$. For the first term in \eqref{eq:sentivity}, by $L$-Lipschitz gradient of $f_i$ and $f_{i^\prime}$ and the update rule, we have  $\|\nabla f_i(\btheta^{t})-\nabla f_{i}(\btheta^{t-1})+\nabla f_{i^\prime}(\btheta^{t})-\nabla f_{i^\prime}(\btheta^{t-1})\|_2\leq 2L\|\btheta^{t}-\btheta^{t-1}\|_2=2L\eta_{t-1}\|\vb^{t-1}_p\|_2$. We then choose an adaptive step size $\eta_{t-1}=\min\big\{\zeta/(L\|\vb^{
%   t-1}_p\|_2),1/(2L)\big\}$ such that $2L\eta_{t-1}\|\vb^{t-1}_p\|_2\leq 2\zeta$. As a result, the sensitivity of $\vb^{t}$ can be upper bounded by $2\big((1-\gamma)\zeta+\gamma G\big)$. 
%   Note that our proposed adaptive step size is tailored for our DP-SRM algorithm and is different from the non-private stochastic recursive momentum algorithms~\citep{cutkosky2019momentum}. 
   
%   For the utility guarantee, we provide a new analysis based on our adaptive step size and establish a new bound for the difference between the full gradient and $\vb^t_p$. The utility guarantee of our method depends on the accuracy of the first-order stationary point, $\zeta$, and the error introduced by our privacy mechanism. By solving for the smallest $\zeta$, we obtain a utility guarantee for our method.

%\dnote{need to introduce this theory - its not the utility bound, but the previous paragraph makes reader expect that it is. This should probably be a Lemma, but I don't understand it well enough to know what its purpose is. We need a sentence like, First, we prove a lemma that [establishes a connection between ...]}\lnote{I have added a sentence at the beginning about the novelty of our analysis.}okay

%\dnote{are all these paragraphs labeled as Remarks because we need to reference them somewhere? it seems unhelpful to call all of these remarks, and would be more useful to have paragraph tags that explain the purpose of the paragraph}\lnote{I write this paragraph to show the novelty of our analysis.}\dnote{but how are you deciding which paragraphs are called "Remark"s?}\lnote{I think we don't need any remarks here?}

%  $\|\nabla f_i(\btheta^{t+1})-\nabla f_{i^\prime}(\btheta^{t+1})-(1-\gamma)\big(\nabla f_i(\btheta^{t})-\nabla f_{i^\prime}(\btheta^{t}\big)\|_2$ and can be further upper bounded by $\|\nabla f_i(\btheta^{t+1})-\nabla f_{i^\prime}(\btheta^{t})\|_2+\gamma\|f_{i^\prime}(\btheta^{t})\|_2$
\begin{theorem}\label{thm:dp}
	Suppose that each component function $f_i$ is $G$-Lipschitz and has $L$-Lipschitz gradient. Given the total number of iterations $T$, the momentum parameter $\gamma$ and the accuracy for the first-order stationary point $\zeta$, for any $\delta>0$ and the privacy budget $\epsilon$, Algorithm \ref{alg:DPSRM} satisfies $(\epsilon,\delta)$-differential privacy with $\sigma_0^2=14TG^2\alpha/(\beta n^2\epsilon)$ and $\sigma^2=14T\big((1-\gamma)\zeta/n_0+\gamma G\big)^2\alpha/(\beta n^2\epsilon)$ if we have  $\alpha-1=\log(1/\delta)/\big((1-\beta)\epsilon\big)\leq 2\sigma^{\prime 2}\log\big(1/\big(\tau\alpha (1+\sigma^{\prime 2})\big)\big)/3$ with $\beta\in(0,1)$ and $\sigma^{\prime2}=\min\{b^2\sigma^2/\big(4((1-\gamma)\zeta/n_0+\gamma G)^2\big), b_0^2\sigma_0^2/(4G^2)\}\geq 0.7$, where $b_0$ and $b$ are batch sizes, and $\tau=\max\{b_0/n,b/n\}$.
\end{theorem}
% \begin{theorem}\label{thm:dp}
% 	Suppose that each component function $f_i$ is $G$-Lipschitz and has $L$-Lipschitz gradient. Given the total number of iterations $T$, the momentum parameter $\gamma$ and the accuracy for the first-order stationary point $\zeta$, for any $\delta>0$ and the privacy budget $\epsilon$, Algorithm \ref{alg:DPSRM} satisfies $(\epsilon,\delta)$-differential privacy with $\sigma_0^2=14TG^2\alpha/(\beta n^2\epsilon)$ and $\sigma^2=14T\big((1-\gamma)\zeta/n_0+\gamma G\big)^2\alpha/(\beta n^2\epsilon)$ if we have  $\alpha=1+\log(1/\delta)/\big((1-\beta)\epsilon\big)\leq 1+\big(2\sigma^{\prime2}/(3+2\sigma^{\prime2})\big)\log(1/\tau\big(1+\sigma^{\prime2})\big)$ with $\beta\in(0,1)$ and $\sigma^{\prime2}=\min\{b^2\sigma^2/\big(4((1-\gamma)\zeta/n_0+\gamma G)^2\big), b_0\sigma_0^2/(4G^2)\}\geq 0.7$, where $b_0$ and $b$ are batch sizes, and $\tau=\max\{b_0/n,b/n\}$.
% \end{theorem}
%1+\big(2\sigma^{\prime2}/(3+2\sigma^{\prime2})\big)\log(1/\tau\big(1+\sigma^{\prime2})\big)
\begin{remark}
According to Theorem \ref{thm:dp}, there exists a constraint on the parameter $\alpha$, which is due to the privacy-amplification by subsampling result in Lemma \ref{lemma:GaussianM_RDP}, and is similar to the constraint given by the moments accountant \citep{abadi2016deep} and other RDP-based analyses with subsampling approaches \citep{mironov2019r,zhu2019poission}. 
% This constraint can be removed if we use the analytic framework proposed by
% \citet{bassily2014private}. However, such an analysis would introduce an extra $\log(T/\delta)$ factor in the variance of the noise, which will lead to a worse utility guarantee. 
Furthermore, as we mentioned in Remark \ref{remark:sub_amp}, our result relaxes the requirement of the variance $\sigma^{\prime2}$  compared with the moments accountant based analysis.
\end{remark}

Following the previous work \citep{bassily2019private}, we can get rid of the constraints in Theorem \ref{thm:dp} by using a larger mini-batch size, as states in the following corollary.
\begin{corollary}\label{coro:dp}
	Given the total number of iterations $T$, the momentum parameter $\gamma$ and the accuracy for the first-order stationary point $\zeta$. Under the same conditions of Theorem \ref{thm:dp} on $f_i,\sigma_0^2,\sigma^2$, for any $\delta>0$ and the privacy budget $\epsilon$, Algorithm \ref{alg:DPSRM} satisfies $(\epsilon,\delta)$-differential privacy if we choose $b_0^2=b^2= n^2\epsilon/T$, $\beta=1/2$, and $T$  is larger than $O\big(\log^4(1/\delta)/\epsilon^3\big)$.
\end{corollary}

Theorem \ref{thm:dp} and Corollary \ref{coro:dp} require that each component function $f_i$ is $G$-Lipschitz and has $L$-Lipschitz gradient which will be used to derive the sensitivity of the underlying query function (i.e., the gradient estimator $\vb^t$ in Algorithm \ref{alg:DPSRM}) and thus determine the variance of the Gaussian noise. The $G$-Lipschitz condition has been widely assumed in the literature of differential privacy \citep{abadi2016deep,wang2017differentially,jayaraman2018distributed,bassily2019private}, and the $L$-Lipschitz gradient condition has also been made in several previous works~\citep{zhang2017efficient,feldman2020private}. In practice, we can use the clipping technique \citep{abadi2016deep} to ensure that at each iteration, $\|\nabla f_i(\btheta^t)\|_2\leq C_1$ and $\|\nabla f_i(\btheta^t)-\nabla f_i(\btheta^{t-1})\|_2\leq C_2$, where $C_1,C_2$ are some predefined constants. As a result, we can guarantee that the sensitivity of $\vb^t$ is bounded by $2\big((1-\gamma)C_2+\gamma C_1\big)/b$ (see \eqref{eq:sentivity}). Then, we can replace $G$ and $\zeta/n_0$ with $C_1$ and $C_2$ in Algorithm \ref{alg:DPSRM} to establish the same privacy guarantee.

% \begin{remark}
% Theorem \ref{thm:dp} requires that each component function $f_i$ is $G$-Lipschitz and has $L$-Lipschitz gradient which will be used to derive the sensitivity of the underlying query function (i.e., the gradient estimator $\vb^t$ in Algorithm \ref{alg:DPSRM}) and thus determine the variance of the Gaussian noise. The $G$-Lipschitz condition has been widely assumed in the literature of differential privacy \citep{abadi2016deep,wang2017differentially,jayaraman2018distributed,bassily2019private}, and the $L$-Lipschitz gradient condition has also been made in several previous works~\citep{zhang2017efficient,feldman2020private}. In practice, we can use the clipping technique \citep{abadi2016deep} to ensure that at each iteration, $\|\nabla f_i(\btheta^t)\|_2\leq C_1$ and $\|\nabla f_i(\btheta^t)-\nabla f_i(\btheta^{t-1})\|_2\leq C_2$, where $C_1,C_2$ are some predefined constants. As a result, we can guarantee that the sensitivity of $\vb^t$ is bounded by $2\big((1-\gamma)C_2+\gamma C_1\big)/b$ (see \eqref{eq:sentivity}). Then, we can replace $G$ and $\zeta$ with $C_1$ and $C_2$ in Algorithm \ref{alg:DPSRM} to establish the same privacy guarantee.
% \end{remark}

The following theorem shows the utility guarantee and the gradient complexity, which is the total number of the stochastic gradients we need to estimate during the training process, of Algorithm \ref{alg:DPSRM}.
\begin{theorem}\label{thm:utility}
% 	Suppose that each component function $f_i$ has $L$-Lipschitz continuous gradient. 
	Under the same conditions of Theorem \ref{thm:dp} on $f_i,\sigma^2,\sigma_0^2,\sigma^{\prime 2},\alpha$, if we choose the  
	number of iterations $T=C_1(n\epsilon LD_F)^{4/3}/\big( G^{8/3}(d\log(1/\delta)^{2/3}\big)$, where $D_F=F(\btheta^0)-F(\btheta^*)$ and $F(\btheta^*$) is a global minimum of $F$, the accuracy for the first-order stationary point  $\zeta=C_2\big(GLD_Fd\log(1/\delta)\big)^{1/3}/(n\epsilon)^{2/3}$, batch sizes $b_0=C_3G^3/(\zeta LD_F)$, $b=C_4G/(n_0\zeta)$, $n_0=LD_F/G^2$, the momentum parameter $\gamma^2=C_5\zeta^2/(n_0^2G^2)$ and $ n\epsilon\geq C_{6}\max\{G^{8}\log^{2}(1/\delta)/(LD_Fd)^{4},\sqrt{G^4d\log(1/\delta)}/(LD_F)\}$,  then the output $\tilde \btheta$ of Algorithm \ref{alg:DPSRM} and satisfies the following
 % 	\begin{align*}
	% \EE\|\nabla F(\tilde \btheta)\|_2 \leq C_7\big(GLD_Fd\log(1/\delta)\big)^{1/3}/(n\epsilon)^{\frac{2}{3}},
	% \end{align*}
	\begin{align*}
	\EE\|\nabla F(\tilde \btheta)\|_2 \leq C_7\bigg(\frac{\sqrt{GLD_Fd\log(1/\delta)}}{n\epsilon}\bigg)^{\frac{2}{3}},
	\end{align*}
% 	\begin{align*}
% 	\EE\|\nabla F(\tilde \btheta)\|_2 \leq C_6G^{1/2}\big(LD_Fd\log(1/\delta)\big)^{1/4}/(n\epsilon)^{1/2},
% 	\end{align*}
	where $\{C_i\}_{i=1}^7$ are absolute constants, and the expectation is taken over all the randomness of the algorithm, i.e., the random Gaussian noise and the subsample gradient. Since $T= O\big((n\epsilon LD_F)^{4/3}/\big( G^{8/3}(d\log(1/\delta)^{2/3}\big)$, $b_0=b=O\big(G^{8/3}(n\epsilon)^{2/3}/(LD_F)^{4/3}(d\log(1/\delta))^{1/3}\big)$, the total gradient complexity of Algorithm \ref{alg:DPSRM} is $O\big((n\epsilon)^{2}/(d\log(1/\delta))\big)$.
% 	$\big(2(T-1)b+b_0\big)$, which is at the order of $O\big((n\epsilon)^{3/2}\big)$.
	\end{theorem}
	
\begin{remark}[\textit{Comparison with existing methods}]
According to Theorem \ref{thm:utility}, our method can achieve the following utility guarantee
$O\big(\big(GLD_Fd\log(1/\delta)\big)^{1/3}/(n\epsilon)^{\frac{2}{3}}\big)$.
% \begin{align*}
%   O\bigg(\bigg(\frac{\sqrt{GLD_Fd\log(1/\delta)}}{n\epsilon}\bigg)^{\frac{2}{3}}\bigg).
% \end{align*}
This result is better than the previous best-known result for differentially private nonconvex optimization method \citep{wang2017differentially}. Furthermore, their method is based on gradient descent, which is computationally very expensive in large-scale machine learning problems.  Furthermore, the gradient complexity of our method is $O\big((n\epsilon)^{2}/\big(d\log(1/\delta)\big)\big)$.
% \begin{align*}
%     O\bigg(\frac{(n\epsilon)^{2}}{d\log(1/\delta)}\bigg).
% \end{align*}
This result is smaller than $O(n^2)$ gradient complexity provided by \citet{zhang2017efficient} and $O\big(n^2\epsilon/(d\log(1/\delta))^{1/2}\big)$ gradient complexity provided by \citet{wang2017differentially} when $d$ is large. Compared with Private SpiderBoost \citep{arora2022faster}, our method has better gradient complexity when $d\geq O\big(\sqrt{n}\epsilon^2/\log(1/\delta)\big)$.
\end{remark}

Theorem \ref{thm:utility} shows that our method only requires the computation of minibatch gradients with batch size at the order of $O\big((n\epsilon)^{2/3}/(d\log(1/\delta)^{1/3}\big)$ (ignoring the dependence on other parameters). Therefore, our method is more scalable than existing differentially private stochastic variance-reduced algorithms, such as DP-SVRG \citep{wang2017differentially} for convex optimization and Private SpiderBoost \citep{arora2022faster} for nonconvex optimization, which often require the periodic computation of the checkpoint gradient with a giant batch size (full batch in DP-SVRG and Private SpiderBoost).


% \begin{remark}
% 	$O\big( G^{1/2}(LD_Fd\log(1/\delta))^{1/4}/(n\epsilon)^{1/2}\big)$ utility guarantee can be achieved by our method. This result matches the best known result for differentially private nonconvex optimization method \citep{wang2017differentially}. However, their method is based on gradient descent, which is computationally very expensive in large-scale machine learning problems. Compared with the stochastic gradient method proposed by \citet{zhang2017efficient}, our utility guarantee is better by a factor of $O(\log(n/\delta))$. Furthermore, the gradient complexity of our method, i.e.,  $O\big((n\epsilon)^{3/2}/d^{3/4}+(n\epsilon)^{1/2}/d^{1/4}\big)$, is smaller than $O(n^2)$ gradient complexity provided by \citet{zhang2017efficient} and $O\big(n^2\epsilon/d^{1/2}\big)$ gradient complexity provided by \citet{wang2017differentially}. More importantly, our method only requires the computation of a large batch gradient with batch size $b_0=O\big((n\epsilon)^{1/2}/d^{1/4}\big)$ at the beginning. Therefore, our method is more scalable than existing differentially private stochastic variance reduced algorithms, such as DP-SVRG \citep{wang2017differentially} designed for convex optimization, which often require the periodic computation of the checkpoint gradient with a giant batch size (full batch in DP-SVRG).  
% \end{remark}

\section{Proof Outline of the Main Results}\label{sec:proof_outline}
In this section, we present the proof outline of the main results in Section \ref{sec:results}. 
Our proof involves new
techniques for the privacy and utility guarantees that are of general use for variance reduction-based algorithms. The detailed proof can be found in Section B in Appendix. 
% which demonstrates the novel ideas and challenges in our privacy and utility analyses.

\subsection{Privacy Guarantee}
According to Algorithm \ref{alg:DPSRM}, the mechanism at $t$-th iteration is $\cM_t$, which is a composition of $t$ Gaussian mechanisms: $\cG_{0},\ldots,\cG_{t}$, where $\cG_{0}=\nabla F_{\cB_0}(\btheta^0)+\ub^{0}$ and $\cG_t=\nabla F_{\cB_t}(\btheta^t)-(1-\gamma)\nabla F_{\cB_t}(\btheta^{t-1})+\ub^t$. Therefore, we want to show that $\cM_t$ is differentially private. For the given dataset $S$, we use $S^\prime$ to denote its neighboring dataset with one different example indexed by $i^\prime$ 

There are two main challenges in providing a tight privacy analysis. The first one is to deal with the subsampled mechanisms $\{\cG_i\}_{i=0}^{T-1}$. The second one is to control the sensitivity of $\cG_t$ when $t>0$. The first challenge can be addressed by our privacy-amplification by subsampling result (Lemma \ref{lemma:GaussianM_RDP}), which gives us a tight closed-form bound on the privacy guarantee. We can overcome the second challenge by using an adaptive stepsize, enabling us to use a small amount of random noise to achieve differential privacy.

According to Algorithm \ref{alg:DPSRM}, $\cG_t$ is the application of the following Gaussian mechanism $\tilde \cG_t$ to a subset of uniformly sampled examples, indexed by $\cB_t$
\begin{align*}
    \tilde \cG_t=
    \left\{
	\begin{array} {ll}
	\frac{1}{b}\sum_{i=1}^n\nabla f_i(\btheta^0)+\ub^0, & t=0\\
		\frac{1}{b}\sum_{i=1}^n\big(\nabla f_i(\btheta^t)-\phi\nabla f_i(\btheta^{t-1})\big)+\ub^t, & t>0,
	\end{array}
	\right.
\end{align*}
where $\phi=1-\gamma$. For $\tilde \qb_0=\sum_{i=1}^n\nabla f_i(\btheta^0)/b_0$ in $\tilde \cG_0$, the sensitivity $\Delta(\tilde \qb_0)$ is determined by
\begin{align*}
   \|\tilde\qb_0(S)-\tilde\qb_0(S^\prime)\|_2\leq\frac{1}{b}\|\nabla f_i(\btheta^{0})-\nabla f_{i^\prime}(\btheta^{0})\|_2\leq \frac{2G}{b_0}, 
\end{align*}
where the last inequality is due to $G$-Lipschitz of each component function. For $\tilde \qb_t=\sum_{i=1}^n\nabla f_i(\btheta^t)/b-(1-\gamma)\sum_{i=1}^n\nabla f_i(\btheta^{t-1})/b$ in $\tilde \cG_t$ when $t>0$, the sensitivity $\Delta(\tilde \qb_t)=\|\tilde \qb_t(S)-\tilde \qb_t(S^\prime)\|_2$ is determined by 
\begin{align}\label{eq:sentivity}
    &\frac{1-\gamma}{b}\|\nabla f_i(\btheta^{t})-\nabla f_{i}(\btheta^{t-1})+\nabla f_{i^\prime}(\btheta^{t})-\nabla f_{i^\prime}(\btheta^{t-1})\|_2\nonumber\\
    &\quad+\frac{\gamma}{b}\|\nabla f_i(\btheta^{t})-\nabla f_{i^\prime}(\btheta^{t})\|_2.
\end{align}
Therefore, we have
\begin{align*}
  \|\qb_t(S)-\qb_t(S^\prime)\|_2&\leq \frac{2L(1-\gamma)}{b}\|\btheta^t-\btheta^{t-1}\|_2+\frac{2\gamma G}{b}\\
  &=\frac{2L(1-\gamma)}{b}\eta_{t-1}\|\vb^{t-1}_p\|_2+\frac{2\gamma G}{b}\\
  &\leq \frac{2(1-\gamma)\zeta}{n_0b}+\frac{2\gamma G}{b},
\end{align*}
where the first inequality is due to $L$-Lipschitz continuous gradient and $G$-Lipschitz of each component function. The last inequality comes from the adaptive stepsize $\eta_t=\min\big\{\zeta/(n_0L\|\vb^t_p\|_2),1/(2n_0L)\big\}$. Note that the proposed adaptive stepsize $\eta_t$ is the key to control the sensitivity of $\tilde \qb_t$. If we choose a fixed stepsize such as $\eta_t=1/(2L)$, the sensitivity of $\tilde \qb_t$ will be in the order of $O(G^2/b)$, which will lead to a much larger random noise to achieve differential privacy and thus deteriorate the utility of our method.

According to Lemma \ref{lemma:GaussianM_RDP}, if the noise $\ub^0$ and $\ub^t$ satisfy $\sigma^2_{0}=14T\alpha G^2/(\beta n^{2}\epsilon)$ and $\sigma^2=14T\alpha\big((1-\gamma)\zeta/n_0+\gamma G\big)^2/(\beta n^2\epsilon)$,
the Gaussian mechanism $\tilde \cG_{t}$ satisfies $\big(\alpha,\beta\epsilon n^2/\big(7b_0^2T\big)\big)$-RDP, and the privacy-amplification by subsampling result shows that $\cG_t$ satisfies $(\alpha,\beta\epsilon/T)$-RDP. Therefore, by the composition rule of RDP \cite{mironov2017renyi}, after $T^\prime$ iterations, Algorithm \ref{alg:DPSRM} satisfies $(\alpha,\beta T^\prime\epsilon/T)$-RDP. According to Lemma \ref{lemma:RDP_to_DP} and $\alpha=\log(1/\delta)/\big((1-\beta)\epsilon\big)+1$, we have that after $T^\prime$ iterations, Algorithm \ref{alg:DPSRM} satisfies $(T^\prime\epsilon/T,\delta)$-DP.

\subsection{Utility Guarantee}
According to the definition of $\tilde \btheta$, we have 
\begin{align*}
    \EE\|\nabla F(\tilde \btheta)\|_2&=\frac{1}{T}\sum_{t=0}^{T-1}\EE\|\nabla F(\btheta^t)\|_2\\
    &\leq \frac{1}{T}\sum_{t=0}^{T-1}\EE\big\|\vb^t_p\big\|_2+\frac{1}{T}\sum_{t=0}^{T-1}\EE\big\|\nabla F(\btheta^t)-\vb^t_p\big\|_2,
\end{align*}
 where the expectation is taken over all the randomness of the algorithm. The key challenge in establishing a tight utility guarantee is to derive tight upper bounds for $\sum_{t=0}^{T-1}\EE\big\|\vb^t_p\big\|_2/T$ and $\sum_{t=0}^{T-1}\EE\big\|\nabla F(\btheta^t)-\vb^t_p\big\|_2/T$ when we have adaptive stepsize $\eta_t$ and the random noise $\ub^t$ in $\vb_p^t$.
% Next, we derive new upper bounds for these two terms by taking into account the effect of the adaptive stepsize $\eta_t$ and the random noise $\ub^t$ to achieve differential privacy. 

First of all, by taking into account the adaptive stepsize $\eta_t$, we can upper bound the term $\sum_{t=0}^{T-1}\EE\big\|\vb^t_p\big\|_2/T$ as follows
\begin{align*}
 \frac{4n_0LD_F}{T\zeta}+\frac{1}{T\zeta}\sum_{t=0}^{T-1}\EE\big\|\nabla F(\btheta^t)-\vb^t_p\big\|_2^2+2\zeta,
\end{align*}
where $D_F=F(\btheta^0)- F(\btheta^*)$. Furthermore,  we can obtain the upper bound for $\sum_{t=0}^{T-1}\EE\big\|\vb^t_p-\nabla F(\btheta^t)\big\|_2^2/T$ as follows
\begin{align*}
     \frac{2(1-\gamma)^2\zeta^2}{n_0^2\gamma b}+\frac{2\gamma G^2}{b}+\frac{G^2}{T\gamma b_0}+\frac{Td\sigma^2+d\sigma_0^2}{T\gamma},
\end{align*}
 where the first term is determined by $\eta_t$, and the last term is determined by the random noise $\ub^t$ in $\vb_p^t$. The last term in this bound is dominated by $d\sigma^2/\gamma$, which validates the necessity of using the adaptive stepsize to control the sensitivity of $\vb^t$ and thus enable a small $\sigma^2$.

Finally, combining these two new bounds and plugging the value of parameters in Theorem \ref{thm:utility}, we can obtain that
\begin{align*}
    \EE\|\nabla F(\tilde \btheta)\|_2\leq C_1\zeta+C_2\frac{\sqrt{GLD_Fd\log(1/\delta)}}{n\epsilon\sqrt{\zeta}}.
\end{align*}
By solving the smallest $\zeta$, we can obtain $\zeta=(GLD_Fd\log(1/\delta))^{1/3}/(n\epsilon C_1/C_2)^{2/3}$. Thus we have $\EE\|\nabla F(\tilde \btheta)\|_2\leq C_3\zeta$, where $C_1,C_2,C_3$ are some constants.

\section{Experiments}\label{sec:exp}
 This section presents results from experiments that evaluate our method's performance on different nonconvex ERM problems and different datasets. All experiments are implemented in
Pytorch platform version 1.2.0 within Python 3.7.6. on a local machine which comes with Intel Xeon 4214 CPUs and NVIDIA GeForce RTX 2080Ti GPU (11G GPU RAM).  
% For the distributed learning setting, we randomly split the training dataset into $m$ subsets, where $m$ is the number of parties, with the same number of training examples. Details on our MPC implementation are in Appendix~\ref{mpc_implementation}.
% This section presents results from experiments that evaluate our method's performance on different nonconvex ERM problems and different datasets. All experiments are implemented in
% Pytorch 1.2.0 on a local server with the NVIDIA GeForce RTX 2080Ti GPU.  


\subsection{Nonconvex Logistic Regression}
% The first nonconvex ERM problem we consider is the binary logistic regression problem with a nonconvex regularizer \citep{reddi2016fast}
We first consider the binary logistic regression problem with a nonconvex regularizer \citep{reddi2016fast}
% i.e., $\min_{\btheta\in\RR^d}n^{-1}\sum_{i=1}^ny_i\log\phi(\xb_i^\top\btheta)+(1-y_i)\log\big[1-\phi(\xb_i^\top\btheta)\big]+\lambda\sum_{i=1}^d\theta_j^2/(1+\theta_j^2)$,
\begin{align*}
    &\textstyle{\min_{\btheta\in\RR^d}}\frac{1}{n}\textstyle{\sum_{i=1}^n}y_i\log\phi(\xb_i^\top\btheta)+(1-y_i)\log\big[1-\phi(\xb_i^\top\btheta)\big]\\
    &\qquad\qquad\qquad+\lambda\sum_{i=1}^d\theta_j^2/(1+\theta_j^2),
\end{align*}
% \begin{align*}
%     \min_{\btheta\in\RR^d}&\frac{1}{n}\sum_{i=1}^ny_i\log\phi(\xb_i^\top\btheta)+(1-y_i)\log\big[1-\phi(\xb_i^\top\btheta)\big]+\lambda\sum_{i=1}^d\theta_j^2/(1+\theta_j^2),
% \end{align*}
where $\phi(x)=1/\big(1+\exp(-x)\big)$ is the sigmoid function, $\theta_j$ is the $j$-th coordinate of $\btheta$, and $\lambda>0$ is the regularization parameter. We set $\lambda=0.001$ in this experiment. 
Here, we consider two commonly-used binary classification benchmark datasets: \textit{a9a} dataset, which contains 32561 training examples, 16281 test examples, 123 features, and \textit{ijcnn1} dataset with 49990 training examples, 91701 test examples, 22 features. We report the results for the \textit{a9a} dataset in the main paper and defer the results for the \textit{ijcnn1} dataset to Appendix A.
%Section~\ref{sec:add_experiments} in the supplemental material.
% Here, we report results for the \textit{a9a} dataset, a commonly-used binary classification benchmark with 32561 training examples, 16281 test examples, and 123 features, and for the \textit{ijcnn1} dataset with 49990 training examples, 91701 test examples, 22 features.

\shortsection{Baseline methods} We compare our method (DP-SRM) with random round private stochastic gradient descent (RRPSGD) proposed by \citet{zhang2017efficient}, differentially private gradient descent (DP-GD) proposed by \citet{wang2017differentially}, and differentially private adaptive gradient descent (DP-AGD) proposed by \citet{lee2018concentrated}. We do not compare our method with Private SpiderBoost \citep{arora2022faster} since it is unclear how to practically determine the privacy guarantee-related parameters of their algorithm.
%%!\vspace{-2pt}
% \noindent\textbf{Baseline methods.} We compare our method (DP-SRM) with random round private stochastic gradient descent (RRPSGD) \citep{zhang2017efficient} , differentially private gradient descent (DP-GD) \citep{wang2017differentially}, and differentially private adaptive gradient descent (DP-AGD) \citep{lee2018concentrated}.

%%!\vspace{-2pt}
\shortsection{Gradient clipping and privacy tracking}
We use the gradient clipping technique of \citet{abadi2016deep} to ensure that at $t$-th iteration of Algorithm \ref{alg:DPSRM}, $\|\nabla f_i(\theta^t)\|_2$ and $\|\nabla f_i(\theta^t)-\nabla f_i(\theta^{t-1})\|_2$ are upper bounded by some predefined values $C_1$ and $C_2$, respectively. This will ensure that the sensitivity of the gradient estimator $\vb^t$ is upper bounded by $2\big((1-\gamma)C_1+\gamma C_2\big)$ (see equation \eqref{eq:sentivity}), and gives us the desired privacy protection. At each iteration, we add the Gaussian noise with variance $\sigma^2$, and keep track of the RDP according to Lemma \ref{lemma:GaussianM_RDP} and transfer it to $(\epsilon,\delta)$-DP according to Lemma \ref{lemma:RDP_to_DP}.

%%!\vspace{-2pt}
\shortsection{Parameters} For all the algorithms, the step size is tuned around the theoretical values to give the fastest convergence using grid search. For our method, we tune the batch size $b$ by searching the grid $\{50,100,200\}$. We set $C_1=1,C_2=0.01$ and $\gamma=C_2$.  We choose $\epsilon\in\{0.2,0.5\}$ and $\delta=10^{-5}$.

%%!\vspace{-2pt}
\shortsection{Results}
Due to the randomized nature of all the algorithms, the experimental results are obtained by averaging the results over 30 runs. Figures~\ref{figure:a9a} shows the objective function value and the gradient norm of different algorithms for privacy budgets $\epsilon\in\{0.2,0.5\}$ on \textit{a9a} datasets. We also report the 95\% confidence interval of these results. We can see from the plots that our DP-SRM algorithm outperforms the other three baseline algorithms in terms of the objective loss, gradient norm, and convergence rate by a large margin. Tables~\ref{table:a9a} summarizes the test error of different algorithms as well as the CPU time (in seconds) of the training process. The results also corroborate the advantages of our method in terms of accuracy and efficiency.


\begin{table*}[!t]
	\caption{Comparison of different algorithms on \textit{a9a} dataset when $\epsilon\in\{0.2,0.5\}$ and $\delta=10^{-5}$. We use the STORM algorithm \citep{cutkosky2019momentum} as the non-private baseline.}
	\label{table:a9a}
	\centering
	\begin{tabular}{l|c|c|c|c|c|c}
		\toprule 
		Privacy&  Non-private& \multirow{2}{*}{Method}&\multirow{2}{*}{Test Error} &Data& \multirow{2}{*}{CPU time (s)}  & \multirow{2}{*}{Gradient Norm}\\Budget& Baseline && & Passes &&\\
		%\cline{2-11}
		\midrule
		\multirow{4}{*}{$\epsilon=0.2$}        & \multirow{4}{*}{0.3346} & DP-GD &    0.4155 (0.0107) &20  &  1.245   &    0.0953 (0.0212)  \\&&DP-AGD&0.3713 (0.0043) &360&96.21&0.0437 (0.0020)\\&&RRPSGD&0.4019 (0.0033) &8&39.61&0.2175 (0.0116)\\&(0.007)&\textbf{DP-SRM}&\textbf{0.3579 (0.0009)}&\textbf{4}&\textbf{0.6007}&\textbf{0.0528 (0.0042)}\\
		\midrule
		\multirow{4}{*}{$\epsilon=0.5$ }     & \multirow{4}{*}{0.3346}  &DP-GD & 0.3859 (0.0057)  &20 &   1.261  &     0.0866 (0.0129) \\&&DP-AGD&0.3627 (0.0038) &365&95.45 &0.0402 (0.0022)\\&&RRPSGD&0.3861 (0.0028)&10&52.32 &0.1454 (0.0126)\\&(0.007)&\textbf{DP-SRM}&\textbf{0.3506 (0.0011)}&\textbf{5}&\textbf{0.7383 }&\textbf{0.0502 (0.0061)}\\
		\hline
	\end{tabular}
\end{table*}

% \begin{table*}[!th]
%  \small
% 	\caption{Comparison of different algorithms on \textit{ijcnn1} dataset under different privacy budgets $\epsilon\in\{0.2,0.5\}$ and $\delta=10^{-5}$. Note that the non-private baseline denotes the test error of the non-private STORM algorithm \citep{cutkosky2019momentum}. }
% 	\label{table:ijcnn1}
% 	\centering
% 	\begin{tabular}{l|c|c|c|c|c|c}
% 		\toprule 
% 		Privacy&  Non-private& \multirow{2}{*}{Method}&\multirow{2}{*}{Test Error} &Data& \multirow{2}{*}{CPU time}  & \multirow{2}{*}{Gradient Norm}\\Budget& Baseline && & Passes &&\\
% 		%\cline{2-11}
% 		\midrule
% 		\multirow{4}{*}{$\epsilon=0.2$}        & \multirow{4}{*}{0.2096} & DP-GD &    0.3160 (0.0120) &20  &  0.5180   &    0.0184 (0.0024)  \\&&DP-AGD&0.2645 (0.0044) &346&90.05&0.0133 (0.0018)\\&&RRPSGD&0.3110 (0.0106) &8&47.64&0.0175 (0.0023)\\&(0.002)&\textbf{DP-SRM}&\textbf{0.2503 (0.0090)}&\textbf{4}&\textbf{0.4748 }&\textbf{0.0117 (0.0008)}\\
% 		\midrule
% 		\multirow{4}{*}{$\epsilon=0.5$ }     & \multirow{4}{*}{0.2096}  &DP-GD & 0.2717 (0.0081)  &20 &  0.4990  &     0.0171 (0.0024) \\&&DP-AGD&0.2416 (0.0029) &365&94.28 &0.0397 (0.0025)\\&&RRPSGD&0.3033 (0.0110)&10&59.06 &0.0160 (0.0018)\\&(0.002)&\textbf{DP-SRM}&\textbf{0.2341 (0.0042)}&\textbf{5}&\textbf{0.4368 }&\textbf{0.0082 (0.0005)}\\
% % 		\midrule
% 		\hline
% 	\end{tabular}
% \end{table*}







\begin{figure*}[t!]%
	\centering
	%\vspace{.3in}
	\subfigure[$\epsilon=0.2$]{
		\label{fig1:subfig:1.a} %% label for first subfigure
		\includegraphics[width=0.23\textwidth]{figure/a9a_02_err_epoch_CI.pdf}}
	\subfigure[$\epsilon=0.5$]{
		\label{fig1:subfig:1.b} %% label for first subfigure
		\includegraphics[width=0.23\textwidth]{figure/a9a_05_err_epoch_CI.pdf}}
			\subfigure[$\epsilon=0.2$]{
		\label{fig1:subfig:1.c} %% label for first subfigure
		\includegraphics[width=0.23\textwidth]{figure/a9a_02_norm_epoch_CI.pdf}}
	\subfigure[$\epsilon=0.5$]{
		\label{fig1:subfig:1.d} %% label for first subfigure
		\includegraphics[width=0.23\textwidth]{figure/a9a_05_norm_epoch_CI.pdf}}

% 	\vspace{-7pt}
	\caption{Results for nonconvex logistic regression on \textit{a9a} dataset. (a), (b) illustrate the objective loss versus the number of epochs. (c), (d) present the gradient norm versus the number of epochs. } \label{figure:a9a}%% label for entire figure
\end{figure*}

% \begin{figure*}[!th]%
% 	\centering
% 	%\vspace{.3in}
% 	\subfigure[$\epsilon=0.2$]{
% 		\label{fig3:subfig:1.a} %% label for first subfigure
% 		\includegraphics[width=0.23\textwidth]{arxiv_v3/figure/ijcnn1_02_err_epoch_CI.pdf}}
% 	\subfigure[$\epsilon=0.5$]{
% 		\label{fig3:subfig:1.b} %% label for first subfigure
% 		\includegraphics[width=0.23\textwidth]{arxiv_v3/figure/ijcnn1_05_err_epoch_CI.pdf}}
% 			\subfigure[$\epsilon=0.2$]{
% 		\label{fig3:subfig:1.c} %% label for first subfigure
% 		\includegraphics[width=0.23\textwidth]{arxiv_v3/figure/ijcnn1_02_norm_epoch_CI.pdf}}
% 	\subfigure[$\epsilon=0.5$]{
% 		\label{fig3:subfig:1.d} %% label for first subfigure
% 		\includegraphics[width=0.23\textwidth]{arxiv_v3/figure/ijcnn1_05_norm_epoch_CI.pdf}}

% 	%\vspace{.3in}
% 	\caption{Results for nonconvex logistic regression on \textit{ijcnn1} dataset. (a), (b) show the objective loss versus the number of epochs. (c), (d) illustrate the gradient norm versus the number of epochs. } \label{figure:ijcnn1}%% label for entire figure
% \end{figure*}


\begin{figure*}[!t]%
	\centering
	%\vspace{.3in}
	\subfigure[$\epsilon=3.0$]{
		\label{fig2:subfig:1.a} %% label for first subfigure
		\includegraphics[width=0.23\textwidth]{figure/MNIST_err_3_steps_CI.pdf}}
	\subfigure[$\epsilon=3.0$]{
		\label{fig2:subfig:1.b} %% label for first subfigure
		\includegraphics[width=0.23\textwidth]{figure/MNIST_err_3_time_CI.pdf}}
	\subfigure[$\epsilon=1.2$]{
		\label{fig2:subfig:1.c} %% label for first subfigure
		\includegraphics[width=0.23\textwidth]{figure/MNIST_err_1_steps_CI.pdf}}
			\subfigure[$\epsilon=1.2$]{
		\label{fig2:subfig:1.d} %% label for first subfigure
		\includegraphics[width=0.23\textwidth]{figure/MNIST_err_1_time_CI.pdf}}
	%\vspace{-7pt}
	\caption{Results on MNIST dataset. (a), (b) depict the test error under the privacy budget $\epsilon=3.0$. (c), (d) illustrate the test error under the privacy budget $\epsilon=1.2$.} \label{fig:mnist}
\end{figure*}


 \begin{figure*}[!thb]%
	\centering
	%\vspace{.3in}
	\subfigure[$\epsilon=2.0$]{
		\label{fig6:subfig:1.a} %% label for first subfigure
		\includegraphics[width=0.23\textwidth]{figure/CIFAR_res_err_2_steps_CI.pdf}}
	\subfigure[$\epsilon=2.0$]{
		\label{fig6:subfig:1.b} %% label for first subfigure
		\includegraphics[width=0.23\textwidth]{figure/CIFAR_res_err_2_time_CI.pdf}}
	\subfigure[$\epsilon=4.0$]{
		\label{fig6:subfig:1.c} %% label for first subfigure
		\includegraphics[width=0.23\textwidth]{figure/CIFAR_res_err_4_steps_CI.pdf}}
			\subfigure[$\epsilon=4.0$]{
		\label{fig6:subfig:1.d} %% label for first subfigure
		\includegraphics[width=0.23\textwidth]{figure/CIFAR_res_err_4_time_CI.pdf}}
% 			\subfigure[$\epsilon=8.0$]{
% 		\label{fig6:subfig:1.e} %% label for first subfigure
% 		\includegraphics[width=0.23\textwidth]{icml_2021/figure_CI/CIFAR_err_8_steps_CI.pdf}}
% 	\subfigure[$\epsilon=8.0$]{
% 		\label{fig6:subfig:1.f} %% label for first subfigure
% 		\includegraphics[width=0.23\textwidth]{icml_2021/figure_CI/CIFAR_err_8_time_CI.pdf}}
% 	\subfigure[$\epsilon=10.0$]{
% 		\label{fig6:subfig:1.g} %% label for first subfigure
% 		\includegraphics[width=0.23\textwidth]{icml_2021/figure_CI/CIFAR_err_10_steps_CI.pdf}}
% 			\subfigure[$\epsilon=10.0$]{
% 		\label{fig6:subfig:1.h} %% label for first subfigure
% 		\includegraphics[width=0.23\textwidth]{icml_2021/figure_CI/CIFAR_err_10_time_CI.pdf}}
	%\vspace{.3in}
	\caption{Results for CNN6 on CIFAR-10 dataset. (a), (b) depict the test error under the privacy budget $\epsilon=2.0$. (c), (d) illustrate the test error under the privacy budget $\epsilon=4.0$} \label{fig:cifar10}
\end{figure*}

\subsection{Convolutional Neural Networks}
We compare our algorithm with the differentially private stochastic gradient descent (DP-SGD) algorithm proposed by \citet{abadi2016deep} on training convolutional neural networks for image classification on both MNIST \citep{lecun1998gradient} and CIFAR-10 \citep{krizhevsky2009learning} datasets. 

\shortsection{Architecture for MNIST} 
For MNIST dataset, we consider a 4 layer CNN \footnote{\url{https://github.com/facebookresearch/pytorch-dp}.}, which can achieve 99\% classification accuracy on the test dataset after training with SGD.
% and for CIFAR 10 dataset, we consider the benchmark architecture in the pytorch tutorial \footnote{\url{https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html}.}.


\shortsection{Parameters for MNIST} 
 We choose privacy budgets $\epsilon\in\{1.2,3.0,7.0\}$, and set $\delta=10^{-5}$. To ensure the privacy guarantee (see \eqref{eq:sentivity}), we set the clipping parameter $C_1=1.5$ for the term $\|\nabla f_i(\theta^t)\|_2$. For the term $\|\nabla f_i(\theta^t)-\nabla f_i(\theta^{t-1})\|_2$, we choose the clipping parameter $C_2$ from the grid $\{0.01,0.1,0.3,0.5,0.7,0.9,0.99\}$. 
 For both DP-SGD and DP-SRM, we tune the batch size $b$ by searching the grid $\{256, 512, 1024\}$ and the step size by $\{0.01,0.05,0.1,0.25,0.5\}$. For DP-SRM, we tune the batch size $b_0$ by $\{b,2b,4b\}$. In addition, we set the momentum parameter $\gamma=C_2$. 
 
\shortsection{Results for MNIST}
 Figures \ref{fig:mnist} illustrates the average test error and the corresponding 95\% confidence interval of different methods versus the number of iterations as well as the training time (in seconds) under the privacy budgets $\epsilon=1.2$ and $\epsilon=3.0$ over 30 trials. We see similar results under the privacy budget $\epsilon=7.0$, and thus defer them in Section A in Appendix. The CNN trained by the non-private SGD can achieve $1\%$ test error after 20 epochs. Figure \ref{fig2:subfig:1.a} and  Figure \ref{fig2:subfig:1.c}  show that our proposed method can achieve $3.62\%$ and $4.49\%$ test errors when $\epsilon=3.0$ and $\epsilon=1.2$, which are better than DP-SGD with $3.81\%$ and $5.33\%$ test errors. Besides,  our method converges faster than DP-SGD. Figure \ref{fig2:subfig:1.a} and Figure \ref{fig2:subfig:1.b} demonstrate that compared with DP-SGD, our method only takes $0.3\times$ iterations and $0.4\times$ training time to achieve comparable performances under the privacy budget $\epsilon=3.0$. 

\shortsection{Architecture for CIFAR-10} 
We consider two convolutional neural networks for CIFAR-10. The first one is a five layer CNN with two convolutional layers and three fully connected layers, and we call it CNN5 \footnote{\url{https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html}.}. For CNN5, we train it from the scratch using our DP-SRM method and the DP-SGD method \citep{abadi2016deep} and compare their performances in terms of the model accuracy, iteration numbers and the training time. For the second one, we consider a similar architecture as in \citet{abadi2016deep}, which has three convolutional layers with 32, 64, 128 filters in each convolution layer and three fully connected layers, and we denote it by CNN6. For CNN6, we follow the same experiment setting as in \citet{abadi2016deep}: we use CIFAR-100 dataset as a public dataset, and first train a network with the same architecture on this dataset as the pretrained model. Then, we initialize the convolutional layers of CNN6 using the cnvolutional layers of the pretrained model, and only train the fully connected layers of CNN6 on CIFAR-10 dataset using different private methods.

\shortsection{Parameters for CNN6} We choose three different privacy budgets $\epsilon\in\{2.0,4.0,8.0\}$ and $\delta=10^{-5}$. We set the clipping parameter $C_1=2$ for the term $\|\nabla f_i(\theta^t)\|_2$. For the term $\|\nabla f_i(\theta^t)-\nabla f_i(\theta^{t-1})\|_2$, we choose the clipping parameter $C_2$ by searching the grid $\{0.01,0.05,0.1,0.3,0.5,0.7,0.9,0.95,0.99\}$. 
For DP-SGD, we tune the step size by searching the grid $\{0.01,0.02,0.05,0.1,0.15,0.2\}$ and the batch size by $\{64, 128, 256\}$. For DP-SRM, we tune the batch size $b$ by searching the grid $\{64, 128, 256\}$, step size by $\{0.01,0.02,0.05,0.1,0.15,0.2\}$, and $b_0$ by $\{b,2b,4b\}$. In addition, we set the momentum parameter $\gamma=C_2$.
 

\shortsection{Results for CNN6} Figure \ref{fig:cifar10} presents the average test error and the corresponding 95\% confidence interval of different methods versus the number of iterations as well as the training time (in seconds) over 30 trials. The CNN6 trained by the non-private SGD will have $18.5\%$ test error after 150  epochs. The results show that our proposed method can achieve $33.2\%$ and $31.0\%$ test errors given $\epsilon=2.0$ and $\epsilon=4.0$, which are comparable to the results of DP-SGD with $33.2\%$ and $31.2\%$ under the same privacy budgets. However, we can see from the plots that our method can significantly reduce the iteration numbers and the training time. For example, when $\epsilon=4.0$, DP-SGD takes $1.3\times 10^{4}$ iterations and 1115 seconds to achieve $31.2\%$ test error. In sharp contrast, our method only takes $6.8\times 10^{3}$ iterations and 643 seconds to achieve $31.0\%$ test error. We can observe similar results for CNN5, which are presented in Section A in Appendix.

\section{Conclusions}\label{sec:8}
We propose an efficient differentially private algorithm for nonconvex ERM. We prove both privacy and utility guarantees for our method. Both theoretical analyses and experiments demonstrate the advantage of our algorithms compared with the state-of-the-art. It would be very interesting to study our method's performances in super large or even industrial-level neural networks. It would also be very interesting to study the optimization lower bound for the differentially private nonconvex stochastic optimization problem.

\section*{Acknowledgement}
We thank the anonymous reviewers for their helpful comments. This work was partially supported by grants from the National Science
Foundation (\#1717950, \#1804603 and \#1915813). The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.


% References
\bibliography{wang_243}

% \newpage
% \bibliography{wang_243}
% \newpage
% \appendix
% \onecolumn
% \input{supp.tex}

\end{document}
