\documentclass[accepted]{uai2022} %for full version
% \documentclass{uai2022}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
%\usepackage{microtype}      % microtypography
\usepackage{bm}
% \usepackage{fullpage}
\usepackage{amsthm,amsmath,amsfonts,amssymb}
\usepackage{mathtools}
\usepackage{algorithmic}
\usepackage{algorithm}
\usepackage{xcolor}
\usepackage{graphicx}
\usepackage{subcaption}
% \usepackage{subfigure}
\usepackage{multirow}
\usepackage{multicol}
\usepackage{hyperref}
\usepackage{mathrsfs, fancyhdr}

% \usepackage[shortlabels]{enumitem}
\usepackage{makecell}
\usepackage{rotating}
%\ \usepackage[switch]{lineno}
 
%\theoremstyle{theorem}
% \theoremstyle{definition}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{proposition}{Proposition}
\newtheorem{corollary}{Corollary}
\newtheorem{definition}{Definition}
\newtheorem{exercise}{Exercise}
\newtheorem{claim}{Claim}
\newtheorem{assumption}{Assumption}
\newtheorem{example}{Example}
\newtheorem{property}{Property}
\newtheorem{remark}{Remark}
\def\S{S}

\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    % \renewcommand{\bibsection}{\subsubsection*{References}}
% \renewcommand{\bibname}{References}
% \renewcommand{\bibsection}{\subsubsection*{\bibname}}


\usepackage{xr}

% In your preamble

\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}

% In your preamble

\myexternaldocument{yang_122-supp}


\usepackage{yang-macros}
\newcommand{\yiming}[1]{{\color{blue}{\noindent{Yiming: \bfseries [}{ \sffamily #1}{\rm\bfseries ]~}}}}
\newcommand{\yunwen}[1]{{\color{blue}{\noindent{Yunwen: \bfseries [}{ \sffamily #1}{\rm\bfseries ]~}}}}
\newcommand{\shu}[1]{{\color{red}{\noindent{shu: \bfseries [}{ \sffamily #1}{\rm\bfseries ]~}}}}
\newcommand{\zhenhuan}{\textcolor{orange}}
\newcommand{\kush}[1]{{\color{orange}{\noindent{Kush: \bfseries [}{ \sffamily #1}{\rm\bfseries ]~}}}}

% \linenumbers

\title{Differentially Private SGDA for Minimax Problems}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Zhenhuan Yang}
\author[2]{Shu Hu}
\author[3]{Yunwen Lei}
\author[4]{Kush R Varshney}
\author[2]{Siwei Lyu}
\author[1]{\href{mailto:<yying@albany.edu>?Subject=Your UAI 2022 paper}{Yiming Ying}{}}
% Add affiliations after the authors
\affil[1]{%
    University at Albany\\
    Albany, New York, USA
}
\affil[2]{%
    University at Buffalo\\
    Buffalo, New York, USA
}
\affil[3]{%
    University of Birmingham\\
    Birmingham, UK
}
\affil[4]{%
    IBM Research\\
    Yorktown Heights, New York, USA
  }

% \aistatsauthor{ Zhenhuan Yang$^a$ \And Shu Hu$^b$ \And Yunwen Lei$^c$ \And  Kush Varshney$^d$  \And Siwei Lyu$^b$ \And Yiming Ying$^a$  }

% \aistatsaddress{$^a$University at Albany \And $^b$University at Buffalo   \And $^c$University of Birmingham \And  $^d$IBM Research} ]
\begin{document}
\maketitle

\begin{abstract}
Stochastic gradient descent ascent (SGDA) and its variants have been the workhorse for solving minimax problems. However,  in contrast to the well-studied stochastic gradient descent (SGD) with differential privacy (DP) constraints,  there is  little work on understanding the generalization (utility)  of SGDA with DP constraints. In this paper, we use the algorithmic stability approach to establish the generalization (utility) of DP-SGDA in different settings. In particular, for the convex-concave setting, we prove that the DP-SGDA can achieve  an optimal utility rate in terms of the weak primal-dual population risk in both smooth and non-smooth cases. To our best knowledge, this is the first-ever-known result for DP-SGDA in the non-smooth case.  We further provide its  utility  analysis in   the nonconvex-strongly-concave setting which is  the  first-ever-known result in terms of the primal population risk.  The convergence and generalization results for this nonconvex setting  are new even in the non-private setting.  Finally,  numerical experiments are conducted to  demonstrate the effectiveness of DP-SGDA  for both convex and nonconvex cases.
\end{abstract}

% \vspace{-6pt}
\section{Introduction}\label{sec:introduction}
% \vspace{-6pt}

In recent years, there is a growing interest on studying the minimax problems which involve both minimization over the primal variable $\wbf$ and maximization over the dual variable $\vbf$. Notable examples include generative adversarial networks (GANs) \citep{goodfellow2014generative,arjovsky2017wasserstein}, AUC maximization \citep{gao2013one,ying2016stochastic,Natole2018,liu2019stochastic,zhao2011online}, robust learning \citep{audibert2011robust,xu2009robustness},   adversarial training \citep{sinha2017certifying},  algorithmic fairness   \citep{mohri2019agnostic,li2019fair,wang2020robust,martinez2020minimax,diana2021minimax}, and  Markov Decision Process (MDP) \citep{puterman2014markov,wang2017primal}. Details of these motivating examples are given in Appendix \ref{sec:motivating-example}.

The minimax  problem can be formulated as 
\begin{equation}\label{eq:SSP}
\min_{\wbf \in \Wcal} \max_{\vbf \in \Vcal}\Big\{ F(\wbf, \vbf) := \Ebb_{\zbf\sim \Dcal}[f(\wbf,\vbf; \zbf)]\Big\}, 
\end{equation}
where $\Wcal\subseteq \mathbb{R}^{d_1}$ and $\Vcal\subseteq \mathbb{R}^{d_2}$ are two nonempty closed and convex domains and $\zbf$ is a random variable from some distribution $\Dcal$ taking values in $\mathcal{Z}$.  
Since the distribution $\Dcal$ is usually unknown and one has access only to an i.i.d. training dataset $\S = \{\zbf_1, \cdots, \zbf_n\}$,  one resorts to solving its  empirical minimax problem 
\begin{align*}
\min_{\wbf \in \Wcal} \max_{\vbf \in \Vcal} \Big\{F_\S(\wbf,\vbf) : = \frac{1}{n}\sum_{i=1}^{n} f(\wbf, \vbf; \zbf_i)\Big\}.
\end{align*}
One popular optimization algorithm for solving this problem is SGDA. Specifically, at iteration $t$, upon receiving a random data point or mini-batch  from $\S$, it performs gradient descent over $\wbf$ with the stepsize $\eta_{\wbf,t}$ and gradient ascent over   $\vbf$ with the stepsize $\eta_{\vbf,t}$. %SGDA can be regarded a relaxed version of stochastic gradient descent with max-oracle (SGDmax) which uses SGD for minimizing the primal objective function i.e., $\min_{\wbf\in \Wcal}\{R_S(\wbf) := \max_{\vbf\in \Vcal} F_S(\wbf,\vbf)\}$ on $\wbf.$ However, this will require an inner loop for finding  $\arg\max_{\vbf\in \Vcal} F_S(\wbf,\vbf)$ at each iteration. 

As SGDA is conceptually simple and easy  to implement, it is widely  deployed in solving minimax problems, e.g., GANs \citep{goodfellow2014generative}, adversarial learning \citep{sinha2017certifying}, and AUC maximization \citep{ying2016stochastic}. Its local convergence analysis for nonconvex-(strongly)-concave problems was established  in \citet{lin2020gradient}. Other variants of SGDA  were proposed and studied in \citet{luo2020stochastic,nouiehed2019solving,rafique2021weakly,yan2020optimal}. 
 
 
On another front, collected data often contain sensitive information such as individual records from hospitals, online behavior from social media, and genomic data from cancer diagnosis.   Differential privacy \citep{dwork2014algorithmic} has emerged as a well-accepted mathematical definition of privacy which ensures that an attacker gets roughly the same information from the dataset regardless of whether an individual is present or not. Its related technologies have been adopted by Google \citep{google-DP}, Apple \citep{miscrosoft-DP}, and the US Census Bureau \citep{us-census-bureau-DP}. While SGD and SGDA have become the workhorse behind  the remarkable progress of machine learning and AI, it is of pivotal importance for developing  their counterparts  with DP constraints.   

Many studies analyze the privacy and utility of DP-SGD for the ERM problem that only involves the minimization over $\wbf$ \citep{bassily2019private,bassily2020stability,feldman2020private,song2013stochastic,WLYZ,wang2020differentially,wang2019subsampled,wu2017bolt,zhou2020private}. In contrast, there is little work on analysing the utility of minimax optimization algorithms with DP constraints except the  recent work of \citet{boob2021optimal}. However, \citet{boob2021optimal} focus on the noisy stochastic extragradient method on convex-concave and smooth settings.  %The utility (generalization) results there hold true only in terms of weak primal-dual population risk. \footnote{} 

Studying the computational and statistical behavior of DP-SGDA is fundamental towards the understanding of stochastic optimization algorithm for minimax problem under the differential privacy constraint. In this paper, we propose novel convergence and stability analysis to establish the utility of DP-SGDA in empirical saddle point and population forms such as the weak primal-dual population risk and the primal population risk. We collect in Table \ref{tab:summary} the notations and results of performance measures in this paper. In particular, our contributions can be summarized as follows. 

\begin{table*}[ht]
    \centering
    %\tlstyle
    \setlength{\tabcolsep}{3pt}
    \begin{tabular}{|c|c|c|c|c|c|}
    \hline
     Algorithm  &  Assumption & Measure & Rate & Complexity & Simplicity \\\hline
         NSEG & C-C, Lip, S & $\triangle^w(A_\wbf(S), A_\vbf(S))$ & $\Ocal\Big(\frac{1}{\sqrt{n}} + \frac{\sqrt{d\log(1/\delta)}}{n\epsilon}\Big)$ & $\Ocal(n^2)$ & Single-loop\\\hline
         NISPP & C-C, Lip, S & $\triangle^w(A_\wbf(S), A_\vbf(S))$ & $\Ocal\Big(\frac{1}{\sqrt{n}} + \frac{\sqrt{d\log(1/\delta)}}{n\epsilon}\Big)$ & $\Ocal(n^{3/2}\log(n))$ & Double-loop\\ \hline
         \multirow{3}{*}{\makecell{DP-SGDA\\(Ours)}} & C-C, Lip, S & $\triangle^w(A_\wbf(S), A_\vbf(S))$ & $\Ocal\Big(\frac{1}{\sqrt{n}} + \frac{\sqrt{d\log(1/\delta)}}{n\epsilon}\Big)$ & $\Ocal(n^{3/2})$ & \multirow{3}{*}{Single-loop}\\\cline{2-5}
         & C-C, Lip & $\triangle^w(A_\wbf(S), A_\vbf(S))$ & $\Ocal\Big(\frac{1}{\sqrt{n}} + \frac{\sqrt{d\log(1/\delta)}}{n\epsilon}\Big)$ & $\Ocal(n^{5/2})$ & \\\cline{2-5}
          & PL-SC, Lip, S & $R(A_\wbf(S)) - \min_{\wbf}R(\wbf)$  & $\Ocal\Big(\frac{1}{n^{1/3}} + \frac{\sqrt{d\log(1/\delta)}}{n^{5/6}\epsilon}\Big)$ & $\Ocal(n^{3/2})$ & \\\hline
    \end{tabular}
    \caption{\it Summary of Results. DP-SGDA is Algorithm \ref{alg:dp-sgda} in this paper. NSEG and NISPP are Algorithm 1 and 2 in \citet{boob2021optimal}, respectively. Here C-C means convexity and concavity, PL-SC means PL condition and strong concavity, Lip means Lipschitz continuity, S means the smoothness. $\triangle^w(A_\wbf(S), A_\vbf(S))$ is the weak PD population risk and $R(A_\wbf(S)) - \min_{\wbf}R(\wbf)$ is the excess primal population risk. \label{tab:summary}}
    % \vspace{-12pt}
\end{table*}

\noindent$\bullet$ We analyze the privacy and utility of  DP-SGDA under the convex-concave setting in terms of the weak primal-dual population risk, i.e., $\max_{\vbf\in\Vcal}\Ebb\big[F(A_\wbf(\S),\vbf)\big]\!-\!\min_{\wbf\in\Wcal}\Ebb\big[F(\wbf,A_\vbf(\S))\big]$, \! where $(A_\wbf(\S),\! A_\vbf(\S))$ is the output of DP-SGDA.  Specifically, we show that it can guarantee $(\epsilon,\delta)$-DP  and achieve the optimal rate $\Ocal\Big(\frac{1}{\sqrt{n}} +  \frac{\sqrt{d\log (1/\delta)}}{n\epsilon}\Big)$  for smooth and nonsmooth cases where $d = \max\{d_1, d_2\}.$ To our best knowledge, this is the first-ever known result for DP-SGDA in the nonsmooth case.    

\noindent $\bullet$ We further study the utility of DP-SGDA in the nonconvex-strongly-concave case in terms of the primal population risk, i.e., $R(A_\wbf(\S)) =  \max_{\vbf\in\Vcal}\Ebb\big[F(A_\wbf(\S),\vbf)\big].$ In particular, under the \PL (PL) condition of $F_\S$, we prove that the excess primal population risk, i.e.,  $R(A_\wbf(\S)) - \min_{\wbf\in \Wcal} R(\wbf)$, enjoys the rate $\Ocal\Big(\frac{1}{n^{1/3}} + \frac{\sqrt{d\log(1/\delta)}}{n^{5/6}\epsilon}\Big)$ while guaranteeing $(\epsilon, \delta)$-DP. The key techniques involve the convergence analysis of $R_S(A_\wbf(\S)) - \min_{\wbf} R_S(\wbf)$ and the stability analysis for $A_\wbf(S)$ which are of interest in their own rights. As far as we are aware, these results are the first ones known for DP-SGDA in the nonconvex setting. 
 
\noindent $\bullet$ We perform numerical experiments on three benchmark datasets which validate the effectiveness of DP-SGDA for both convex and non-convex cases.

% \vspace{-6pt}
\subsection{Motivating Examples}
% \vspace{-6pt}
We give two examples of minimax problems under the DP constraint. See Appendix \ref{sec:motivating-example} for more examples and details.

\textbf{AUC Maximization.} Area Under the ROC Curve (AUC) is a widely used measure for binary classification. It has been shown optimizing AUC is equivalent to a minimax problem once auxiliary variables $a, b, v \in \Rbb$ are introduced \citep{ying2016stochastic}.
\begin{align*}
\min_{\theta, a, b}\max_{v}\Big\{	F(\theta,a,b,v) = \Ebb_\zbf[f(\theta, a, b, v;\zbf)]\Big\}.
\end{align*}
Differential privacy has been applied to learn private classifier by optimizing AUC \citep{wang2021differentially}.

\textbf{Generative Adversarial Networks.} Originally proposed in \citet{goodfellow2014generative}, GAN in general can be written as a minimax problem between a generator network $G_\vbf$ and a discriminator network $D_\wbf$
\begin{align*}
\min_{\wbf}\max_{\vbf} \mathbb{E}[f(\wbf,\vbf;\zbf,\xi)] \!=\! \mathbb{E}_{\zbf} [D_\wbf(\zbf)] \!-\! \mathbb{E}_{\xi} [D_\wbf(G_\vbf(\xi))].
\end{align*}
DP-SGDA and its variants were employed to train differential private GANs by \citet{xie2018differentially}. Recently differential privacy has successfully applied to private data generation by GAN framework \citep{jordon2018pate, beaulieu2019privacy}.

% \vspace{-6pt}
\subsection{Related Work}
% \vspace{-6pt}
% \bigskip\medskip
% \noindent\textbf{1.1\;\; Related Work}

%\medskip

Below  we briefly discuss some related work. 

\noindent{\bf Convergence analysis for SGDA.} It is a classical result that SGDA can achieve a convergence rate   $\Ocal(1/\sqrt{T})$ in the convex and concave case \citep{nedic2009subgradient,nemirovski2009robust} where $T$ is the number of iterations. For the nonconvex-(strongly)-concave case,  the work of  \citet{lin2020gradient} shows the local convergence of SGDA if  the stepsizes $\eta_{\wbf,t}$ and  $\eta_{\vbf,t}$ are chosen to be appropriately different. Other important studies consider variants of SGDA and prove their local  convergence for the nonconvex case. Such algorithms include nested algorithms  \citep{rafique2021weakly} for weakly-convex-weakly-concave problems, multi-step GDA  \citep{nouiehed2019solving} under the one-sided PL condition,  epoch-wise SGDA  \citep{yan2020optimal}, and  stochastic recursive SGDA \citep{luo2020stochastic} for  nonconvex-strongly-concave problems, to mention but a few.  
 

\noindent{\bf Stability and generalization of non-private SGD and SGDA.} %Investigating the statistical generalization of  optimization algorithms typically involves the trade-off between the estimation (generalization) error, i.e. the difference between the population risk and the empirical one,  and optimization error (convergence rate). 
The studies of \citep{hardt2016train,charles2018stability,kuzborskij2018data} use uniform stability \citet{bousquet2002stability} to derive the generalization of non-private SGD for the convex and smooth case while the convex and nonsmooth case was established by \citet{bassily2020stability,lei2020fine}. The nonconvex case under the PL-condition was  considered by \citet{charles2018stability,lei2021sharper}. The stability and generalization of SGDA for minimax problems were studied by \citet{lei2021stability} in different forms for convex and nonconvex, smooth, and nonsmooth cases, and by  \citet{farnia2021train} with focus on the smooth cases.

{\bf DP-SGD and DP-SGDA.} DP-SGD was shown to attain the optimal excess population risk  $\Ocal({1}/{\sqrt{n}} +  {\sqrt{d\log (1/\delta)}}/{n\epsilon})$ in \citet{bassily2019private,bassily2020stability,WLYZ,wang2020differentially} for the convex case.  For nonconvex objectives,  \citet{wang2019differentially} studied the DP Gradient Langevin Dynamics, and \citet{zhang2021private} studied a multi-stage type of DP-SGD assuming the  weakly-quasi-convexity and PL condition.  In \citet{xie2018differentially,zhang2018differentially},  DP-SGDA and its variants together with clipping techniques were employed to train differentially private GANs which showed  promising results in applications. However, no utility analysis was given there.  \citet{boob2021optimal} focused on the noisy stochastic extragradient method with DP constraints for minimax problems in the convex-concave and smooth settings and provided its utility analysis using variational inequality (VI) and stability approaches. 
% \citep{zhou2020private} considered adaptive private gradient-based methods including DP RMSProp and DP Adam. They can only guarantee to find stationary points of the population risk with the rate $\tilde{\Ocal}(\frac{\sqrt{d\log(1/\delta)}}{n\epsilon})$.


% \textbf{Differentially Private GANs.} Releasing the generator as a substitute for the original training data distribution entails privacy risks. \citep{hitaj2017deep} introduced an active inference attack model that can reconstruct training samples from the generated ones. \citep{chen2020gan} studied membership inference attack against deep generative models. \citep{xie2018differentially} initiated the work on differentially private GANs, where they proposed to add Gaussian noise towards the parameters of the discriminator only, as the parameters of the generator satisfies differential privacy guarantee by the post-processing property.

% \textbf{Two-time-scale Update Rule and Alternating Gradient Descent Ascent.} There is a vast literature studying the convergence of SGDA for (strongly)-convex-(strongly)-concave minimax optimization problems with equal stepsizes.  \citep{heusel2017gans} showed that two-time-scale SGDA converges under mild assumptions to a stationary local Nash equilibrium. \citep{gidel2018variational} showed that both SGDA and AGDA fail to converge if equal stepsizes are used in general unconstrained bilinear games. \citep{gidel2019negative} showed that SGDA algorithms with constant stepsizes may fail to converge for bilinear smooth games, while AGDA may not. 

% \vspace{-6pt}
\section{Problem  Formulation }\label{sec:preliminaris}
% \vspace{-6pt}
In this section, we introduce necessary assumptions, notations and the DP-SGDA algorithm.
% \vspace{-6pt}
\subsection{Assumptions and Notations}\label{sec:assumption}
% \vspace{-6pt}
Firstly, we introduce necessary  assumptions and notations. A function $h: \Wcal \rightarrow \Rbb$ is said to be convex if,  for all $\wbf, \wbf' \in \Wcal$, there holds $h(\wbf) \geq h(\wbf')+ \langle\nabla h(\wbf'), \wbf - \wbf'\rangle$ where $\nabla$ is the gradient operator and $\langle \cdot, \cdot\rangle$ is the inner product. Let $\|\cdot\|_2$ denote the Euclidean norm. We say $h$ is $\rho$-strongly-convex if $h - \frac{\rho}{2}\|\wbf\|_2^2$ is convex, $h$ is concave if $-h$ is convex, and $\rho$-strongly-concave if $-h-\frac{\rho}{2}\|\wbf\|_2^2$ is convex. Let $[n]:=\{1,2,\ldots,n\}$.

\begin{definition}
Given a function $h: \Wcal \times \Vcal \rightarrow \Rbb$. We say $h$ is convex-concave if for any $\vbf \in\Vcal$, the function $\wbf \mapsto h(\wbf,\vbf)$ is convex and for any $\wbf \in\Wcal$, the function $\vbf \mapsto h(\wbf,\vbf)$ is concave.
\end{definition}

\begin{assumption}[\textbf{A1}]\label{ass:lipschitz}
The function $f$ is said to be Lipschitz continuous if there exist $G_\wbf, G_\vbf > 0$ such that, for any $\wbf, \wbf' \in \Wcal, \vbf, \vbf' \in \Vcal$ and $\zbf \in \Zcal$, 
$\|f(\wbf, \vbf; \zbf) - f(\wbf', \vbf; \zbf)\|_2 \leq  G_\wbf \|\wbf - \wbf'\|_2$, and    
$\|f(\wbf, \vbf; \zbf) - f(\wbf, \vbf'; \zbf)\|_2 \leq  G_\vbf \|\vbf - \vbf'\|_2.$ And denote $G = \max\{G_\wbf, G_\vbf\}$.
\end{assumption}

\begin{assumption}[\textbf{A2}]\label{ass:bounded-variance}
For randomly drawn $j\in [n]$, the gradients $\nabla_\wbf f(\wbf, \vbf; \zbf_j)$ and $\nabla_\vbf f(\wbf, \vbf; \zbf_j)$ have bounded variances $B_\wbf$ and $B_\vbf$ respectively. And let $B=\max\{B_\wbf, B_\vbf\}$.
\end{assumption}

\begin{assumption}[\textbf{A3}]\label{ass:smooth}
The function $f$ is said to be smooth if it is continuously differentiable and there exists a constant $L > 0$ such that for any $\wbf,\wbf'\in \Wcal$, $\vbf, \vbf'\in \Vcal$ and $\zbf \in \Zcal$,
\begin{align*}
\left\|\!\begin{pmatrix}\!\nabla_\wbf f(\wbf, \vbf; \zbf) \!-\! \nabla_\wbf f(\wbf', \vbf'; \zbf)\!\\
\!\nabla_\vbf f(\wbf, \vbf; \zbf) \!-\! \nabla_\vbf f(\wbf', \vbf'; \zbf)\!
\end{pmatrix}\!\right\|_2 \!\leq\! L\! \left\|\!\begin{pmatrix}
\!\wbf \!-\! \wbf'\\
\!\vbf \!-\! \vbf'
\end{pmatrix}\!\right\|_2 
\end{align*}
% The function $f$ it is said to be smooth if it is continuously differentiable and there exist constants $L_{11}, L_{12}, L_{21}, L_{22} > 0$ such that for any $\wbf,\wbf'\in \Wcal, \vbf, \vbf'\in \Vcal$ and $\zbf \in \Zcal$
% \begin{align*}
% \|\nabla_\wbf f(\wbf, \vbf; \zbf) - \nabla_\wbf f(\wbf', \vbf; \zbf)\|_2 \leq & L_{11} \|\wbf - \wbf'\|_2 \\   
% \|\nabla_\wbf f(\wbf, \vbf; \zbf) - \nabla_\wbf f(\wbf, \vbf'; \zbf)\|_2 \leq & L_{12} \|\vbf - \vbf'\|_2 \\ 
% \|\nabla_\vbf f(\wbf, \vbf; \zbf) - \nabla_\vbf f(\wbf', \vbf; \zbf)\|_2 \leq & L_{21} \|\wbf - \wbf'\|_2 \\
% \|\nabla_\vbf f(\wbf, \vbf; \zbf) - \nabla_\vbf f(\wbf, \vbf'; \zbf)\|_2 \leq & L_{22} \|\vbf - \vbf'\|_2
% \end{align*}
\end{assumption}

We also require the \PL (PL) condition. 
\begin{definition}[\citep{polyak1964gradient}]\label{ass:pl}
A function $h: \Wcal \rightarrow \Rbb$ satisfies the PL condition if there exist a constant $\mu>0$ such that, for any $\wbf \in \Wcal$, 
$ \frac{1}{2}\|\nabla h(\wbf)\|_2^2 \geq  \mu(h(\wbf) - \min_{\wbf' \in \Wcal} h(\wbf')).$
\end{definition}
We refer to \citet{karimi2016linear} for a nice discussion of this condition and other general conditions that allow the global convergence of gradient descent.
 
% \vspace{-6pt}
\subsection{DP-SGDA Algorithm}\label{sec:algorithm}
% \vspace{-6pt}
We now move on to the definition of differential privacy and the description of  DP-SGDA.  Differential privacy was introduced by \citet{dwork2006calibrating,dwork2014algorithmic}. We say that two datasets $\S,\S'$ are neighboring datasets if they differ by at most one example.

\begin{algorithm}[ht!]
\caption{Differentially Private Stochastic Gradient Descent Ascent (DP-SGDA) Method\label{alg:dp-sgda}}
\begin{algorithmic}[1]
\STATE {\bf Inputs:} data $\S = \{\zbf_i: i \in [n]\}$, privacy budget $\epsilon, \delta$, number of iterations $T$, learning rates $\{\eta_{\wbf,t}, \eta_{\vbf,t}\}_{t=1}^T$, and initialize $(\wbf_0, \vbf_0)$
\STATE Compute noise parameters $\sigma_\wbf$ and $\sigma_\vbf$ based on Eq. \eqref{eq:sigma-sigma}
\FOR{$t=1$ to $T$}
\STATE Sample a mini-batch $I_t = \{i_t^1, \cdots, i_t^m\in [n]\}$ uniformly with replacement
\STATE Sample independent noises $\xi_t \sim \Ncal(0, \sigma_\wbf^2 I_{d_1})$ and $\zeta_t \sim \Ncal(0, \sigma_\vbf^2 I_{d_2})$
\STATE
$
\!\wbf_{t\!+\!1} \!=\! \Pi_{\Wcal}\!\Big(\!\wbf_t \!-\! \eta_{\wbf,t} \!\Big(\frac{1}{m}\!\sum_{j\!=\!1}^m \!\nabla_\wbf \!f(\wbf_t, \vbf_t; \zbf_{i_t^j}) \!+\! \xi_t\!\Big)\!\Big)
$
\STATE
$
\vbf_{t\!+\!1} \!=\! \Pi_{\Vcal}\!\Big(\!\vbf_t \!+\! \eta_{\vbf,t} \!\Big(\!\frac{1}{m}\!\sum_{j=1}^m\!\nabla_\vbf \!f(\wbf_t, \vbf_t; \zbf_{i_t^j}) \!+\! \zeta_t\!\Big)\!\Big)
$
\ENDFOR
\STATE {\bf Outputs:}\!\! $(\bar{\wbf}_T, \bar{\vbf}_T)\! =\! \frac{1}{T}\displaystyle\sum_{t=1}^T (\wbf_t, \vbf_t)$ or $({\wbf}_T, {\vbf}_T)$
\end{algorithmic}
\end{algorithm} 
\begin{definition}[Differential Privacy]
A (randomized) algorithm $A$ is called $(\epsilon,\delta)$-differentially private (DP) if, for all neighboring datasets $\S, \S'$ and for all events $O$ in the output space of $A$, the following holds % \shu{Miss definition of $\S^{\prime}$}
\[
\Pbb[A(\S) \in O] \leq e^\epsilon	\Pbb[A(\S') \in O] + \delta.
\]	
\end{definition}
Our aim is to design a randomized algorithm satisfying $(\epsilon, \delta)$-DP which solves the empirical minimax problem: 
\begin{equation}\label{eq:ESPP} \min_{\wbf\in \Wcal}\max_{\vbf\in \Vcal} \Big\{F_\S(\wbf,\vbf) = \frac{1}{n}\sum_{i=1}^n f(\wbf,\vbf; \zbf_i)\Big\}. 
\end{equation}
Notice that in the standard ERM problem, which involves the minimization only with respect to $\wbf$,  DP-SGD \citep{wu2017bolt,song2013stochastic,bassily2019private,wang2020differentially} uses the gradient perturbation at each iteration. Specifically,   at each iteration of this algorithm, a randomized gradient estimated from a random subset (mini-batch) of $\S$ is perturbed by a Gaussian noise and then the model parameter is updated based on this noisy gradient. 

Following the same spirit, DP-SGDA  \citep{xie2018differentially,zhang2018differentially} adds Gaussian noises per iteration to the randomized gradient mapping $(g_{\wbf,t}, g_{\vbf,t}) = (\frac{1}{m}\sum_{j=1}^m\nabla_\wbf f(\wbf_t, \vbf_t; \zbf_{i_t^j}), \frac{1}{m}\sum_{j=1}^m\nabla_\vbf f(\wbf_t, \vbf_t; \zbf_{i_t^j}))$ where the index of example $\zbf_{i_t^j}$ is from the mini-batch $I_t$.  Then, the primal variable $\wbf$ is updated by gradient descent based on the noisy gradient $g_{\wbf,t} +\xi_t$ and the dual variable $\vbf$ is updated by gradient ascent based on the noisy gradient $g_{\vbf,t} +\zeta_t$.  The pseudo-code for DP-SGDA is given in Algorithm \ref{alg:dp-sgda}. The noise levels $\sigma_\wbf, \sigma_\vbf$ are given by \eqref{eq:sigma-sigma} which will be specified soon in Section \ref{sec:results} in order to guarantee $(\epsilon, \delta)$-DP.  The notations $\Pi_{\Wcal}(\cdot)$ and $ \Pi_{\Vcal}(\cdot)$ denote the projections to $\Wcal$ and $\Vcal$, respectively. 
From now on, the notation $A$ denotes the DP-SGDA algorithm and its  output is denoted by  $A(\S) = (A_\wbf(\S), A_\vbf(\S)).$

 
% \vspace{-6pt}
\subsection{Measures of Utility}\label{sec:measure}
% \vspace{-6pt}
Since the model $A(\S)$ is only trained based on the training data $\S$, its empirical behavior as measured by $F_\S$ may not generalize well on test data.  Our goal is to investigate the statistical behavior of  $A(\S)$ on the test data in terms of some population risk. However, unlike the standard statistical learning theory (SLT) setting where there is only a minimization of $\wbf$, we have different measures of population risk due to the minimax structure \citep{zhang2021generalization,lei2021stability}. 
Let $\Ebb[\cdot]$ denote the expectation with respect to the randomness of  algorithm $A$ and data $S$. We are particularly  interested in the following metrics.  

\begin{definition}[Weak Primal-Dual (PD) Risk]
The weak primal-dual population risk of $A(\S)$,  denoted by $\triangle^w(A_\wbf(\S), A_\vbf(\S))$,  is defined as
$$\max_{\vbf\in\Vcal}\Ebb\big[F(A_\wbf(\S),\vbf)\big]\!-\!\min_{\wbf\in\Wcal}\Ebb\big[F(\wbf,A_\vbf(\S))\big].$$
The corresponding weak PD empirical risk, denoted by $\triangle^w_S(A_\wbf(\S), A_\vbf(\S))$,  is defined as
\[
\max_{\vbf\in\Vcal}\Ebb\big[F_S(A_\wbf(\S),\vbf)\big]\!-\!\min_{\wbf\in\Wcal}\Ebb\big[F_S(\wbf,A_\vbf(\S))\big].
\]
\end{definition}
% \begin{definition}[Strong Primal-Dual Risk]
% The strong PD population risk of a model $({\wbf},{\vbf})$ is defined as%~\citep{zhang2020generalization}
% \[
% \triangle^s({\wbf},{\vbf})=\sup_{\vbf'\in\Vcal}F({\wbf},\vbf')-\inf_{\wbf'\in\Wcal}F(\wbf',{\vbf}).
% \]
% The strong PD empirical risk of $({\wbf},{\vbf})$ is defined as
% \[
% \triangle^s_\\S({\wbf},{\vbf})=\sup_{\vbf'\in\Vcal}F_\S({\wbf},\vbf')-\inf_{\wbf'\in\Wcal}F_\S(\wbf',{\vbf}).
% \]
% We refer to $\triangle^s({\wbf},{\vbf})-\triangle^s_\S({\wbf},{\vbf})$ as the strong PD generalization error of the model $({\wbf},{\vbf})$.
% \end{definition}
\begin{definition}[Primal Risk] The primal population risk of $A(\S)$ is given by $R(A_\wbf(\S))=\max_{\vbf\in\Vcal}F(A_\wbf(\S),\vbf)$ and empirical risk is defined by $R_S(A_\wbf(\S))=\max_{\vbf\in\Vcal}F_S(A_\wbf(\S),\vbf)$, respectively. The excess primal population risk is defined as
$$
\Ebb\big[R(A_\wbf(\S))-\min_{\wbf\in\Wcal}R(\wbf)\big].
$$ 
The corresponding excess primal empirical risk is then
$$
\Ebb\big[R_S(A_\wbf(\S))-\min_{\wbf\in\Wcal}R_S(\wbf)\big].
$$ 
\end{definition}

% The strong PD risk is often referred to as the expected duality gap in the optimization community. We have $\triangle^w(A_\wbf(\S), A_\vbf(\S)) \leq \triangle^s(A_\wbf(\S), A_\vbf(\S))$ by applying Jensen's inequality. 
Meanwhile, the strong PD risk defined as $\triangle^s({\wbf},{\vbf})=\Ebb\big[\sup_{\vbf'\in\Vcal}F({\wbf},\vbf')-\inf_{\wbf'\in\Wcal}F(\wbf',{\vbf})\big]$. We have $\triangle^w(A_\wbf(\S), A_\vbf(\S)) \leq \triangle^s(A_\wbf(\S), A_\vbf(\S))$ by applying Jensen's inequality. However, when $F$ is strongly-convex-strongly-concave, the point distance from the model $(A_\wbf(\S), A_\vbf(\S))$ to the true saddle point $(\wbf^*, \vbf^*) \in \arg\min_{\wbf\in \Wcal}\max_{\vbf \in \Vcal} F(\wbf, \vbf)$ can be bounded by the weak PD population risk, i.e.\ $\Ebb[\|A_\wbf(\S) - \wbf^*\|_2^2 + \|A_\vbf(\S) - \vbf^*\|_2^2] \leq \Ocal(\triangle^w(A_\wbf(\S), A_\vbf(\S)))$. For certain problems, it is suffices to bound the weak PD risk, such as the learning problem for Markov decision process in Appendix \ref{sec:motivating-example}. The primal risk is more meaningful when one is concerned about the risk with respect to the primal variable, such as the AUC maximization problem.
% The corresponding empirical risks are defined by replacing $F$ with $F_S$. As we shall see in the following section, bounding the empirical risks of DP-SGDA serves as an important step before bounding the population risks.

% \begin{definition}[Uniform Argument Stability]
% Let $S = \{\zbf_1,\cdots,\zbf_n\}$ and $S' = \{\zbf'_1,\cdots,\zbf'_n\}$ be two neighborhood datasets that differ only in one single example. For any $\gamma \in (0,1)$, $\Acal$ is called $\epsilon_{\text{stab}}$-UAS with probability $1-\gamma$ if for any neighborhood datasets $S$ and $S'$,
% \[
% \Pbb_\Acal[\|\Acal(S) - \Acal(S')\|_2 > \epsilon_{\text{stab}}] \leq \gamma.
% \]	
% \end{definition}

 
% \vspace{-6pt}
\section{Main Results}\label{sec:results}
% \vspace{-6pt}

In this section, we present our main theoretical results for DP-SGDA. For the privacy guarantee,  we leverage the moments accountant method \citep{abadi2016deep}, which implies tight privacy loss for adaptive Gaussian mechanisms with amplification by subsampling.  Below we summarize a specific version of this method that suffices for our purpose.%\footnote{In our case we use uniform sampling on each iteration, as opposed to the Poisson sampling in \citep{abadi2016deep}; however, it is possible to verify that similar moment estimates lead to our stated result \citep{wang2019subsampled}}.



\begin{theorem}\label{thm:moments-accountant-privacy}Let \textbf{(A1)} hold true. Then, 
there exist constants $c_1, c_2$ and $c_3$ so that given the mini-batch size $m$ and total iterations $T$, for any $\epsilon < c_1 m^2T/n^2$, Algorithm \ref{alg:dp-sgda} is $(\epsilon, \delta)$-differentially private for any $\delta > 0$ if we choose
\begin{equation}\label{eq:sigma-sigma}
\sigma_\wbf \!=\! \frac{c_2 G_\wbf \sqrt{T\log(1/\delta)}}{n\epsilon},\, \sigma_\vbf \!=\! \frac{c_3 G_\vbf \sqrt{T\log(1/\delta)}}{n\epsilon}.
\end{equation}
\end{theorem}
 
The proof of Theorem \ref{thm:moments-accountant-privacy} is given in Appendix \ref{sec:proof-privacy}.
% \yiming{Do we prove this theorem? what do you mean by $\sigma$?  does $\sigma$ mean the same for primal and dual variables? What is $ \Delta(\Acal)$ in our case?} 

\begin{remark}\label{rem:choice-of-param}
In practice, given privacy budget $\epsilon, \delta$ and parameters $m, T$, the constant $c_2$ and hence $\sigma$ can be found by grid search \citep{abadi2016deep}. Here we provide a set of parameters that satisfies the condition  in that reference and our Theorem \ref{thm:moments-accountant-privacy}. That is, by choosing $\epsilon \leq 1, \delta\leq 1/n^2$ and $m = \max(1, n\sqrt{\epsilon/(4T)})$, then we have explicit values for the variances as $
\sigma_\wbf = \frac{8 G_\wbf \sqrt{T\log(1/\delta)}}{n\epsilon}, \sigma_\vbf = \frac{8 G_\vbf \sqrt{T\log(1/\delta)}}{n\epsilon}.$
\end{remark}

\begin{remark}\label{rem:different-sensitivity}
Our Algorithm \ref{alg:dp-sgda} allows the application of independent noises $\xi_t, \zeta_t$ with different $\sigma_\wbf, \sigma_\vbf$, respectively. In  \citet{boob2021optimal}, a uniform $\sigma$ is used (Theorem 5.4 or 7.4 there) for both primal and dual variables. In many examples, the primal and dual gradients $\nabla_\wbf f(\wbf_t, \vbf_t,\zbf_{i_t^j}), \nabla_\vbf f(\wbf_t, \vbf_t,\zbf_{i_t^j})$ enjoy different Lipschitz constants ($\ell_2$-sensitivity).  Therefore, our treatment leads to a more delicate way of calibrating the variances of the Gaussian noises. As we shall see in the experiments in Section \ref{sec:exp}, this treatment enables Algorithm \ref{alg:dp-sgda} to achieve better performance.
\end{remark}


In the subsequent subsections, we present our main contribution of this paper, i.e., the  utility bounds of DP-SGDA for the convex-concave and nonconvex-strongly-concave cases, respectively.

% \vspace{-6pt}
\subsection{Convex-Concave Case}\label{sec:convex} 
% \vspace{-6pt}
In this subsection,  we present the utility bound of DP-SGDA for  the convex-concave case in terms of the weak PD risk of the output $(\bar{\wbf}_T,\bar{\vbf}_T)$ of Algorithm \ref{alg:dp-sgda}. 

\begin{theorem}\label{thm:sgda-utility}
Assume the function $f$ is convex-concave. Assume $\Wcal$ and $\Vcal$ are bounded so that $\max_{\wbf\in \Wcal}\|\wbf\|_2 \leq D_\wbf$, $\max_{\vbf\in \Vcal}\|\vbf\|_2 \leq D_\vbf$. And let $D = \max\{D_\wbf, D_\vbf\}$. Let the stepsizes $\eta_{\wbf,t} = \eta_{\vbf,t} = \eta$ for all $t \in [T]$ with some $\eta > 0$. Under one of the condition 
\begin{enumerate}[topsep=-1pt, partopsep=-1pt]
\item[a)] Assumption \textbf{(A1)} and \textbf{(A3)} hold true and we choose $T \asymp n$ and $\eta \asymp 1/\big(\max\{\sqrt{n}, \sqrt{d\log(1/\delta)}/\epsilon\}\big)$, 
\item[b)] or Assumption  \textbf{(A1)} holds true and we choose $T \asymp n^2$ and $\eta \asymp 1/\big(n\max\{\sqrt{n}, \sqrt{d\log(1/\delta)}/\epsilon\}\big)$,
\end{enumerate}
then Algorithm \ref{alg:dp-sgda} satisfies
\begin{align*}
\triangle^w(\bar{\wbf}_T,\bar{\vbf}_T) = \Ocal\Big( \max\Big\{\frac{1}{\sqrt{n}}, \frac{\sqrt{d\log(1/\delta)}}{n\epsilon}\Big\}\Big).
\end{align*}
\end{theorem}

Its detailed proof can be found in Appendix \ref{sec:proof-convex}.  The proof mainly relies on the concept of stability \citep{bousquet2002stability,charles2018stability,hardt2016train,kuzborskij2018data}.  Specifically,   the weak PD population risk  can be decomposed as follows:  
\begin{align*}\label{eq:weak-err-decomp}
\triangle^w(\bar{\wbf}_T, \bar{\vbf}_T)  = & \triangle^w(\bar{\wbf}_T, \bar{\vbf}_T) - \triangle^w_\S(\bar{\wbf}_T, \bar{\vbf}_T)\\
& + \triangle^w_\S(\bar{\wbf}_T, \bar{\vbf}_T), \numberthis
\end{align*}
where the term $\triangle^w(\bar{\wbf}_T, \bar{\vbf}_T) - \triangle^w_\S(\bar{\wbf}_T, \bar{\vbf}_T)$ is the generalization error and $\triangle^w_\S(\bar{\wbf}_T, \bar{\vbf}_T)$ is the optimization error.


The estimation for the optimization error can be conducted by standard techniques \citep{nemirovski2009robust}. We give a self-contained proof in Appendix \ref{sec:cc-opt}. The generalization error is estimated using  a concept of weak stability \citep{lei2021stability}. Specifically, we say the randomized algorithm $A$ is  {\em $\varepsilon$-weakly-stable} if,  for any neighboring sets $\S, \S'$ differing at one single datum,  there holds \begin{align*} & \sup_\zbf\big(\!\sup_{\vbf \in \Vcal}\Ebb_{A}[f(A_\wbf(\S), \vbf; \zbf) - f(A_\wbf(\S'), \vbf; \zbf)]\\  &  + \sup_{\wbf \in \Wcal}\Ebb_{A}[f(\wbf, A_\vbf(\S); \zbf) \!\!-\!\! f(\wbf, A_\vbf(\S'); \zbf)]\big) \!\!\leq \varepsilon.\end{align*}
We know from \citet{lei2021stability} that  $\varepsilon$-weak-stability implies 
    $\triangle^w(A_{\wbf}(\S),A_{\vbf}(\S))-\triangle^w_\S(A_{\wbf}(\S),A_{\vbf}(\S))\leq\varepsilon.$


In Appendix \ref{sec:cc-gen}, we prove the weak stability of DP-SGDA (i.e.\ Algorithm 1) for both smooth and nonsmooth cases. Putting the estimations for the optimization error and generalization error into \eqref{eq:weak-err-decomp} can yield the bound in Theorem \ref{thm:sgda-utility}. %One can find detailed proofs  in Appendix \ref{sec:proof-convex}.  
We end this subsection with some remarks. 

% \begin{remark}
% The same optimal utility bound $\Ocal(\max\{\frac{1}{\sqrt{n}}, \frac{\sqrt{d\log(1/\delta)}}{n\epsilon}\})$ was claimed in \citep{boob2021optimal} in terms of the VI gap (Theorem 5.4 there). By the monotonicity assumption, their results imply the same bound on the primal-dual risk $\triangle(A_\wbf(\S), A_\vbf(\S)) = \Ebb[\max_\vbf F(A_\wbf(\S), \vbf) - \min_\wbf F(\wbf, A_\vbf(\S))]$ of the minimax problem \eqref{eq:SSP}. Such measure of generalization seems to be stronger than ours since
% $ 
% \triangle^w(A_\wbf(\S), A_\vbf(\S)) \leq \triangle(A_\wbf(\S), A_\vbf(\S)).$
% However, the discussion between stability and generalization there is not rigorous. Indeed, borrowing their notations, equation (4.3) from their paper, i.e., 
% $
% \Ebb_{\beta_i}\langle F(u; \beta_i), \Acal(\S) - u\rangle \leq \Ebb_{\beta_j}\langle F(u; \beta_j), \Acal(\S) - u\rangle\\
% + M \Ebb_{\beta'_j}[\|\Acal(\S_j^i) - \Acal(\S^j)\|],$
% only holds when $u$ is independent of $\beta_j$. It does not hold when (4.3) is applied in (4.5) with $\hat{u}_i = \arg\max_{u \in \Wcal} \Ebb_{\beta_i}\langle F(u; \beta_i), \Acal(\S) - u\rangle$ since it depends on $\beta_j$.
% \end{remark}

\begin{remark}\label{rem:optimality}
The utility bound $\Ocal\Big(\max\Big\{\frac{1}{\sqrt{n}}, \frac{\sqrt{d\log(1/\delta)}}{n\epsilon}\Big\}\Big)$ is optimal for convex-concave minimax problem. A lower bound with the same order has been established in the convex ERM setting \citep{bassily2014private, bassily2019private, feldman2020private} and the measure of utility is given by $\Ebb[F(A_\wbf(S)) - min_{\wbf \in \Wcal} F(\wbf)]$. Here we slightly abuse the notation to indicate $F$ as the population risk and $A_\wbf(S)$ as the algorithm for the ERM problem. Since the convex-concave minimax problem is a special case of convex ERM problems when the dual variable is constant, this lower bound also applies to our setting.
\end{remark}

\begin{remark}\label{rem:compare-extragradient}
The same optimal utility was claimed in \citet{boob2021optimal}. Yet our results also possess two theoretical gains compared to theirs. Firstly, when the smoothness assumption holds, Part a) in our Theorem \ref{thm:sgda-utility} shows the optimal utility with $T = \Ocal(n)$ iterations and $\Ocal(n^{3/2})$ gradient computations by Remark \ref{rem:choice-of-param}, while their single-looped algorithm (Algorithm 1 there) requires $\Ocal(n^2)$ gradient computations in their Theorem 5.4. They further improved the gradient complexity to $\Ocal(n^{3/2}\log(n))$ in Theorem 7.4, which, however, requires an extra subroutine algorithm (inner-loop)  (Algorithm 2 there). Secondly, we also derive the same optimal bound with only Lipschitz continuous  assumption for the nonsmooth case which was not addressed in \citet{boob2021optimal}.
\end{remark}

% \footnote{The bound was claimed in terms of the strong PD risk, which seems to be stronger than ours. However, their proof is not rigorous. See Appendix \ref{sec:proof-bug} for detail.}. 


% \begin{theorem}\label{thm:convex-concave-opt}
% Suppose \textbf{(A1)} holds, and $F_S$ is convex-concave. Assume $\Wcal$ and $\Vcal$ are bounded so that $\max_{\wbf\in \Wcal}\|\wbf\|_2 \leq D_\wbf$, $\max_{\vbf\in \Vcal}\|\vbf\|_2 \leq D_\vbf$. Let the stepsizes $\eta_{\wbf,t} = \eta_{\vbf,t} = \eta$, $t \in [T]$ for some $\eta > 0$. If we choose $T \asymp n$ and $\eta \asymp D/\big(G\max\{\sqrt{n}, \sqrt{d\log(1/\delta)}/\epsilon\}\big)$, then Algorithm \ref{alg:dp-sgda} satisfies %\yunwen{$\eta$ should be $\eta^{\wbf}_t$?}
% \[
% \Ebb_A[\triangle^s_S(\bar{\wbf}_T, \bar{\vbf}_T)] = \Ocal\Big(\max\Big\{\frac{1}{\sqrt{n}}, \frac{\sqrt{d\log(1/\delta)}}{n\epsilon}\Big\}\Big)
% \]
% \end{theorem}
% \vspace{-6pt}
\subsection{Nonconvex-Strongly-Concave Case}\label{sec:nonconvex-strongly-concave}
% \vspace{-6pt}
We proceed to the case when $f$ is non-convex-strongly-concave. In this case, we can present utility bounds of DP-SGDA in terms of the primal excess risk, i.e.,  $R(\wbf_T) - \min_{\wbf\in \Wcal} R(\wbf)$,  where $\wbf_T$ is the last iterate  of Algorithm \ref{alg:dp-sgda}. Generally speaking, a saddle point may not always exist without the convexity assumption. Since our goal in this paper is to find global optima, we assume that the saddle point of the empirical minimax problem exists, i.e., there exists $(\hat{\wbf}_\S, \hat{\vbf}_\S)$ such that,   for any $\wbf\in \Wcal$ and $\vbf\in \Vcal$,
\begin{align*}
F_\S(\hat{\wbf}_\S, \vbf) \leq F_\S(\hat{\wbf}_\S, \hat{\vbf}_\S) \leq F_\S(\wbf, \hat{\vbf}_\S). 
\end{align*}

To estimate the primal excess risk,  we define $R^*_\S = \min_{\wbf\in\Wcal} R_\S(\wbf), \text{ and } R^* = \min_{\wbf\in\Wcal} R(\wbf).$ Then, for any $\wbf^* \in \arg\min_\wbf R(\wbf)$ we have the error decomposition:
\begin{align*}\label{eq:err-decomp}
\Ebb[R(\wbf_T) & \!-\! R^*] \!=\!  \Ebb[R(\wbf_T) \!\!-\!\! R_\S(\wbf_T)] \!+\! \Ebb[R_\S(\wbf_T) \!\!-\!\! R_\S^*]\\
& +  \Ebb[R_\S^* \!-\! R_\S(\wbf^*)] \!+\! \Ebb[R_\S(\wbf^*) \!-\! R(\wbf^*)]\\
\leq & \Ebb[R(\wbf_T) \!-\! R_\S(\wbf_T)] \!+\! \Ebb[R_\S(\wbf^*) \!-\! R(\wbf^*)]\\
& + \Ebb[R_\S(\wbf_T) \!-\! R_\S^*],\numberthis
\end{align*}
where the last inequality follows from the fact that $R_\S^* - R_\S(\wbf^*)\le 0 $ since $R_\S^* = \min_{\wbf\in \Wcal}R_\S(\wbf)$. The term $\Ebb[R_\S(\wbf_T) - R_\S^*]$ is the {\em optimization error} which characterizes the discrepancy between the primal empirical risk of an output of Algorithm \ref{alg:dp-sgda} and the least possible one. The term $\Ebb[R(\wbf_T) - R_\S(\wbf_T)]  + \Ebb[R_\S(\wbf^*) - R(\wbf^*)]$ is called the {\em generalization error} which measures the discrepancy  between the primal population risk and the empirical one. The estimations for these two errors are described as follows. 

\textbf{Optimization Error.} The next theorem characterizes the primal empirical risk of DP-SGDA under the PL-SC assumption.

\begin{theorem}\label{thm:sgda-primal-opt}
Assume Assumptions \textbf{(A1)} and \textbf{(A2)} hold true,  and the function $F_\S(\wbf, \cdot)$ is $\rho$-strongly concave and $F_\S(\cdot, \vbf)$ satisfies $\mu$-PL condition. Assume $\Vcal$ is bounded. Let $\kappa = L/\rho$.  If we choose $\eta_{\wbf,t} \asymp \frac{1}{\mu t}$ and $\eta_{\vbf,t} \asymp \frac{\kappa^{2.5}}{\mu^{1.5} t^{2/3}}$, then $$\Ebb[R_\S(\wbf_{T+1}) - R_\S^*] = \Ocal\Big(\frac{\kappa^{3.5}}{\mu^{2.5}}\Big(\frac{1/m + d(\sigma_\wbf^2 + \sigma_\vbf^2)}{T^{2/3}}\Big)\Big).$$
\end{theorem}
We provide the proof of  Theorem \ref{thm:sgda-primal-opt} in Appendix \ref{sec:sgda-primal-opt}. In the non-private setting, i.e. $\sigma_\wbf=\sigma_\vbf=0$, Theorem \ref{thm:sgda-primal-opt} implies that the convergence rate in terms of the primal empirical risk is of the order  $\Ocal(\frac{\kappa^{3.5}}{\mu^{2.5} T^{2/3}}),$ which is a new result even in the non-private case as far as we are aware of.  

In \citet{lin2020gradient}, the local convergence of SGDA in the non-private case was proved in terms of the metric $\Ebb_{\tau}[\|\nabla R_\S(\wbf_\tau)\|_2^2]$ where $\tau$ is chosen uniformly at random from the set $\{1, 2, \ldots, T\}.$  Our analysis is much more involved since it proves the global convergence of the last iterate $\wbf_T$.  Our main idea is to prove the coupled recursive inequalities for two terms, i.e., 
$a_t = R_\S(\wbf_t) - R_\S^*$ and $b_t = \|\vbf_t - \hat{\vbf}_\S(\wbf_t)\|_2^2$ where $ \hat{\vbf}_\S(\wbf_t)= \arg\max_{\vbf\in \Vcal} F_\S(\wbf_t, \vbf)$,  and  then  carefully derive the the convergence rate for $a_t+ \lambda_t b_t$ by choosing $\lambda_t$ appropriately. The convergence rate and its proof can be of interest in their own right. One can find more detailed arguments in Appendix \ref{sec:sgda-primal-opt}. 

% \begin{theorem}
% For SGDA, if $\eta_{\wbf,t} = \Ocal(\frac{1}{t})$ and $\eta_{\vbf,t} = \Ocal(\frac{1}{t^{2/3}})$, then for any $\lambda > 0$, the iterates $\{\wbf_t, \vbf_t\}_{t \in [T]}$ satisfies the following inequality
% \begin{align*}
% \Ebb[R_S(\Wbf_T) - R_S(\wbf^*)] + \lambda \Ebb[\|\vbf^*(\wbf_T) - \vbf_T\|_2^2] = \Ocal(\frac{1}{T^{1/3}}). 
% \end{align*}
% \end{theorem}


\textbf{Generalization Error.} We present the bound for the generalization error which is proved again using the stability approach.

We begin with a discussion of the saddle points. While the saddle point $(\hat{\wbf}_\S, \hat{\vbf}_\S)$ may not be unique,  $\hat{\vbf}_\S$ must be unique if $F_\S(\wbf, \vbf)$ is strongly-concave in $\vbf$ (see Proposition \ref{lem:unique-v} in Appendix \ref{sec:proof-nonconvex}).  Therefore, we can define $\pi_\S(\wbf)$ the projection of $\wbf$ to the set of saddle points, as  ${\Omega}_\S = \bigl\{\hat{\wbf}_\S:   (\hat{\wbf}_\S, \hat{\vbf}_\S)  \in \arg\min_{\wbf\in \Wcal}\max_{\vbf\in \Vcal}F_\S(\wbf, \vbf)\bigr\} =\bigl\{\hat{\wbf}_\S:  \hat{\wbf}_\S \in \arg\min_{\wbf\in \Wcal}F_\S(\wbf, \hat{\vbf}_\S)\bigr\} $.

Recall that $\wbf_T$ is the iterate of DP-SGDA at time $T$ based on the training data $\S$. Likewise, we denote by $\wbf'_T$   based on the training set $\S'$ which differs from $\S$ at one single datum. Due to the possibly multiple saddle points, we need the following critical assumption for estimating the generalization error.  
\begin{assumption}[\textbf{A4}]\label{ass:unique-projection}
For the (randomized) algorithm DP-SGDA,   assume that $\pi_{\S'}(\pi_\S (\wbf_T)) = \pi_{\S'} (\wbf'_T)$ for any neighboring sets $\S$ and $ \S'.$    
\end{assumption}  
 
Assumption \textbf{(A4)} was introduced in \citet{charles2018stability} for studying the stability of SGD in  the non-convex case which only involves the minimization over $\wbf$. In our case, \textbf{(A4)} holds true whether the saddle point is unique (e.g., $F_\S$ is strongly-convex and strongly-concave) or the two sets of saddle points based on $\S$ and $\S'$, i.e. $\Omega_\S$  and $\Omega_{\S'}$ do not change too much.  Since our algorithm satisfies $(\epsilon, \delta)$-DP  it means that the distributions of $\wbf_T$ and $\wbf'_T$ generated from two neighboring sets $S$  and $S'$ are ``close'', which indicates  $\sup_{S,S'}\|\pi_{S'}(\pi_S(\wbf_T)) - \pi_{S'}(\wbf'_T)\|_2$ can be small. Proving such statement serves as an interesting open problem.

Now we can state the results on the generalization error. 

\begin{theorem}[Generalization Error]\label{thm:sgda-primal-gen}
Assume Assumptions \textbf{(A1)}, \textbf{(A3)} and \textbf{(A4)}  hold true, and  assume  the function $f(\wbf, \cdot; \zbf)$ is $\rho$-strongly concave and $F_\S(\cdot, \vbf)$ satisfies $\mu$-PL condition. Let $\kappa = L/\rho$. If $\Ebb[R_\S(\wbf_{T}) - R_\S^*] \leq \varepsilon_T$, then
$$\!\Ebb[R(\wbf_T) \!-\! R_\S(\wbf_T)]  \!\leq \!  (1\!+\!\kappa)G_\wbf\Big(\sqrt{\frac{\varepsilon_T}{2\mu}} \!+\! \frac{1}{n}\sqrt{\frac{G_\wbf^2}{4\mu^2}\! + \!\frac{G_\vbf^2}{\rho\mu}} \Big), 
$$
and 
$$
\Ebb[R_\S(\wbf^*) - R(\wbf^*)] \leq  \frac{4G_\vbf^2}{\rho n}.$$
\end{theorem}
The proof of Theorem \ref{thm:sgda-primal-gen} is provided in Appendix \ref{sec:sgda-primal-gen}. 


\begin{remark} The  generalization error bounds given in Theorem \ref{thm:sgda-primal-gen} indicate that if the optimization error $\Ebb[R_\S(\wbf_{T}) - R_\S^*] $ is small then the generalization error will be small.  This is consistent with the observation in the stability and generalization analysis of SGD \citep{charles2018stability,hardt2016train,lei2021sharper} for the minimization problems in the sense of ``optimization can help generalization". 
\end{remark}


We can derive the following utility bound for DP-SGDA by combining the results in Theorems \ref{thm:sgda-primal-gen} and  \ref{thm:sgda-primal-opt}. 

\begin{theorem}\label{thm:utility-nonconvex}
Under the same assumptions of Theorem \ref{thm:sgda-primal-gen}, if we choose $T \asymp n$, $\eta_{\wbf,t} \asymp \frac{1}{\mu t}$ and $\eta_{\vbf,t} \asymp \frac{\kappa^{2.5}}{\mu^{1.5} t^{2/3}}$, then 
$$
\Ebb[R(\wbf_{T+1}) - R^*] = \Ocal\Big(\frac{\kappa^{2.75}}{\mu^{1.75}}\Big(\frac{1}{n^{1/3}} + \frac{\sqrt{d\log(1/\delta)}}{n^{5/6}\epsilon}\Big)\Big).$$
\end{theorem}

The proof can be found in Appendix \ref{sec:agda-utility}. 

 
%\begin{remark}Consider the nonconvex minimization problem $\min_{\wbf \in \Wcal} \widetilde{R}(\wbf) = \Ebb_{\zbf \in \Dcal}[\tilde{f}(\wbf; \zbf)]$ and its empirical counterpart $\widetilde{R}_\S(\wbf)$. Since the minimization objectives $\widetilde{R}(\wbf)$ and $\widetilde{R}_\S(\wbf)$ can be treated as $R(\wbf)$ and $R_\S(\wbf)$, respectively, when $\Vcal$ is a one-point set, our bound naturally extends to $\Ebb[\widetilde{R}(\wbf_T) - \min_{\wbf\in \Wcal}\widetilde{R}(\wbf)]$. This is the first-ever-known excess population for nonconvex minimization problem under the differential privacy constraint and the PL condition. It is worth mentioning that \citep{zhang2021private} obtained the excess population risk $\Ocal(\frac{1}{\sqrt{\theta\mu n}}+\frac{d\log(1/\delta)}{\theta^2\mu n^2\epsilon^2})$ but they require both $\mu$-PL condition and $\theta$-weakly-quasi-convex assumption.  \yiming{the purpose of this remark to explain our results are no good or just review the related work.  If it is related work, we already mentioned previously.}\end{remark}


% the best-known utility bound is $\Ebb_{\tau}[\|\nabla \widetilde{R}(\wbf_\tau)\|_2^2] = \Ocal\Big(\frac{\sqrt{d\log(1/\delta)}}{n\epsilon}\Big)$ where $\tau$ is choosing uniformly at random from $[T]$ \citep{zhou2020private}. In terms of population risk, \citep{wang2019differentially} provided the only bound $\Ebb[\widetilde{R}(\wbf_T) - \min_{\wbf\in \Wcal}\widetilde{R}(\wbf)] = \Ocal\Big(\frac{d\log(1/\delta)}{\log(n)\epsilon^2}\Big)$.

% \vspace{-6pt}
\section{Experiments}\label{sec:exp}
% \vspace{-6pt}
In this section, we evaluate the performance of DP-SGDA by taking AUC maximization as an example. Due to space limitation, we present the most significant information and results of our experiments while  more detailed information and additional results are given in Appendix \ref{sec:add-details} and \ref{sec:additional-exp}. 
% \vspace{-6pt}
\subsection{Experimental Settings}\label{sec:setting}
% \vspace{-6pt}
\textbf{Baseline Model.} We perform experiments on the problem of AUC maximization with the least square loss to evaluate the DP-SGDA algorithm in linear and non-linear settings (two-layer multilayer perceptron (MLP)). In this case, AUC maximization can be formulated as
 \begin{align*}
\min_{\mathbb{\theta}\in \Theta} \Ebb_{\zbf,\zbf'}[(1 - h(\theta;\xbf) + h(\theta;\xbf'))^2|y=1, y'=-1], \end{align*}
where $h: \Theta\times \Rbb^d\rightarrow\Rbb$ is the scoring function. As shown in \citet{ying2016stochastic}, it  is equivalent to a minimax problem:
\begin{align*}
\min_{\wbf=(\theta, a, b)}\max_{\vbf}	\Ebb_\zbf[f(\theta, a, b, \vbf;\zbf)],
\end{align*}
where $f = (1-p)(h(\theta;\xbf) - a)^2\Ibb[y=1] + p(h(\theta;\xbf) - b)^2\Ibb[y=-1] + 2(1+\vbf)(ph(\theta;\xbf)\Ibb[y=-1] - (1 - p)h(\theta;\xbf)\Ibb[y=1])] - p(1-p)\vbf^2$ and $p = \Pbb[y=1]$.

When $h$ is a linear function, the AUC learning objective above is convex-strongly-concave. On the other hand, when $h$ is a MLP function, it becomes a nonconvex-strongly-concave minimax problem. In addition, following \citet{liu2019stochastic}, we use Leaky ReLU as an activation function for MLP. It was shown in their paper the empirical AUC objective satisfies the PL condition with this choice of $h$. Without a special statement, we set $256$ as the number of hidden units in MLP and $64$ as the mini-batch size during the training.    

\textbf{Datasets and Evaluation Metrics.}
Our experiments are based on three popular datasets, namely ijcnn1 \citep{chang2011libsvm}, MNIST \citep{lecun1998gradient}, and Fashion-MNIST \citep{xiao2017fashion} that have been used in previous studies. For MNIST and Fashion-MNIST, following \citet{gao2013one, ying2016stochastic}, we transform their classes into  binary classes by randomly partitioning the data into two groups, each with an equal number of classes. 
% Furthermore, the features in each dataset are first normalized to [0,1] and then be normalized using their mean and standard deviation. 
For ijcnn1, we randomly split its original training set into new training ($80\%$) and testing ($20\%$) sets. For MNIST and Fashion-MNIST, we use their original training set and testing set. For each method, the reported performance is obtained by averaging the AUC scores on the test set according to $5$ random seeds (for initial $\wbf$ and $\vbf$, sampling and noise generation).

\begin{table*}[th!]
% \captionsetup{font=footnotesize}
\centering
\setlength\tabcolsep{2.5pt}
{
\begin{tabular}{|c|cc|c|cc|c|cc|c|}
\hline
Dataset & \multicolumn{3}{c|}{ijcnn1}         & \multicolumn{3}{c|}{MNIST}         & \multicolumn{3}{c|}{Fashion-MNIST}         \\ \hline
\multirow{2}{*}{Algorithm} & \multicolumn{2}{c|}{Linear} &  MLP  & \multicolumn{2}{c|}{Linear} &   MLP    & \multicolumn{2}{c|}{Linear} &    MLP   \\ \cline{2-10} 
             &    NSEG       &    DP-SGDA       &    DP-SGDA   &    NSEG       &    DP-SGDA       &    DP-SGDA       &    NSEG       &    DP-SGDA       &    DP-SGDA       \\ \hline
Original &    92.191 &  92.448 &  96.609  & 93.306  & 93.349  &99.546  & 96.552 &    96.523    &   98.020   \\ \hline
 \makecell{$\epsilon$=0.1}     & 90.106    &  91.110   &  92.763 &   91.247  &  91.858 &  97.878&  95.446  &  95.468   & 95.692\\ \cline{1-10} 
                           \makecell{$\epsilon$=0.5}    &  90.346 & 91.357   &  95.840 &   91.324 &  92.058 &  98.656 &  95.530&  95.816    & 96.988   \\ \cline{1-10} 
                            \makecell{$\epsilon$=1}     & 90.355 & 91.371 &  96.167 &   91.330 &  92.070 & 98.705&  95.534 &  95.834  & 97.102\\ \cline{1-10} 
                           \makecell{$\epsilon$=5}   & 90.363 & 91.383 &  96.294 &  91.334  &  92.078 & 98.742 & 95.538&   95.848 & 97.198\\ \cline{1-10} 
                             \makecell{$\epsilon$=10}  & 90.363& 91.386 &  96.297  &   91.334 &  92.080 &  98.747& 95.539&    95.850  & 97.213  \\ \hline
\end{tabular}
% \vspace{-0.4em}
\caption{\it Comparison of AUC performance in NSEG and DP-SGDA (Linear and MLP settings) on three datasets with different $\epsilon$ and $\delta$=1e-6. The ``Original'' means no noise ($\epsilon=\infty$) is added in the algorithms.}
\label{tab:general_performance_partial}
}
% \vspace{-0.4em}
\end{table*}



\begin{figure*}[t]
\begin{subfigure}[t]{0.32\linewidth}
    \includegraphics[width=\linewidth]{figs/noise_size_delta-6.pdf}
    % \caption{}
\end{subfigure}%
    \hfill%
\begin{subfigure}[t]{0.32\linewidth}
    \includegraphics[width=\linewidth]{figs/diff_hidden_units_eps1_delta-6.pdf}
    % \caption{}
\end{subfigure}
    \hfill%
\begin{subfigure}[t]{0.32\linewidth}
    \includegraphics[width=\linewidth]{figs/diff_batch_size_eps1_delta-6.pdf}
    % \caption{}
\end{subfigure}
\caption{\em  (a) Comparison of $\sigma$ for NSEG and DP-SGDA (Linear setting) on three datasets with different $\epsilon$ and $\delta$=1e-6. (b)Comparison of AUC performance for SGDA and DP-SGDA in MLP settings on three datasets with different hidden units and $\epsilon$=1 and $\delta$=1e-6. (c) Comparison of AUC performance for DP-SGDA (Linear and MLP settings) on three datasets with different batch size and $\epsilon$=1 and $\delta$=1e-6.}
% \vspace{-12pt}
\label{fig:sigma}
\end{figure*}

\textbf{Privacy Budget Settings.} In the experiments, we set up five privacy levels from small to large: $\epsilon\in\{0.1, 0.5, 1, 5, 10\}$. We also consider three different $\delta$ from $\{\mathrm{1e\!-\!4}, \mathrm{1e\!-\!5}, \mathrm{1e\!-\!6}\}$. Due to space limitation, we only report the performance when $\delta=\mathrm{1e\!-\!6}$. More results can be found in Appendix \ref{sec:additional-exp}. To estimate the Lipschitz constants $G_\wbf$ and $G_\vbf$ (in Theorem \ref{thm:moments-accountant-privacy}), we first run the algorithms without adding noise. Then we calculate the maximum gradient norms of AUC loss w.r.t $\wbf$ and $\vbf$ and assign them as $G_\wbf$ and $G_\vbf$, respectively. According to these parameters, we calculate the noise parameter $\sigma$ by applying autodp\footnote{\url{https://github.com/yuxiangw/autodp}}, which is widely used in the existing works \citep{wang2019subsampled}.

\textbf{Compared Algorithms.} \citet{boob2021optimal} is the only existing paper that considers differential privacy in the convex-concave minimax problem. Therefore, we use their single-loop NSEG algorithm as our baseline method on the AUC optimization under the linear setting.
% on the other hand, we provide the first algorithm and theoretical guarantee for the differential privacy in the nonconvex-strongly-concave minimax problem. There are no existing methods that we can compare on the AUC optimization under the non-linear setting. 
% \vspace{-6pt}
\subsection{Results}\label{experiments:results}
% \vspace{-6pt}
We report our evaluation and results on the utility and privacy trade-off of the DP-SGDA. Then we follow the experiment design by \citep{abadi2016deep} to study the effect of the parameters - hidden units and batch sizes.

\textbf{General AUC Performance vs Privacy.} The general performance of all algorithms under linear and MLP settings of AUC optimization is shown in Table \ref{tab:general_performance_partial}. Since the standard deviation of the AUC performance is around $[0, 0.1\%]$ and the difference between different algortihms is very small, we only report the average AUC performance. First, without adding noise into gradients, we can find the NSEG method and our DP-SGDA method have similar performance under the linear case. 
% However, it should be mentioned that our DP-SGDA is more efficient than NSEG during the training because it uses extra gradient method. 
Furthermore, we can find the performance of the DP-SGDA with MLP model can outperform linear models on all datasets. This is because non-linear models have better expression power and therefore it can learn more information among features than linear models. Second, by adding noise into the gradients, we can find the AUC performance of all models is decreased on all datasets. However, by increasing the privacy budget $\epsilon$, the AUC performance is increased. The reason is that $\epsilon$ and $\sigma$ have opposite trends according to equation \eqref{eq:sigma-sigma}. The relation between $\epsilon$ and AUC score also verifies our Theorem \ref{thm:sgda-utility} and Theorem \ref{thm:utility-nonconvex}.
% In addition, as the $\epsilon$ increases (from $\epsilon=5$ to $\epsilon=10$), we can find the AUC performance does not change too much. This is because the $\sigma$ value is very close in these settings.
Third, to verify our statement in Remark \ref{rem:different-sensitivity},
% as we discussed in Remark \ref{rem:different-sensitivity}, the NSEG method uses  the same $\sigma$ for the primal and dual gradients. Their noise should be larger than ours because we use different Lipschitz constants for the primal ($\wbf$) and dual ($\vbf$) gradients. 
% To verify this phenomenon, 
we compare the $\sigma$ values from NSEG and DP-SGDA on all datasets in Figure \ref{fig:sigma}(a). From the figure, it is clear that the $\sigma$ from NSEG is larger than ours in all $\epsilon$ settings since it is calibrated based on the gradients' sensitivity from both $\wbf$ and $\vbf$. In fact, the sensitivity w.r.t. $\vbf$ is small as it is a one-dimensional variable for AUC maximization. Therefore, NSEG leads to overestimate on the noise addition towards $\vbf$. From Table \ref{tab:general_performance_partial} we observe our DP-SGDA achieves better AUC score than NSEG under the same privacy budget.

\textbf{Different Hidden Units.} 
In DP-SGDA under the MLP setting, the hidden unit is one of the most important factors affecting the model performance. Therefore, we compare the AUC performance with respect to the different hidden units in Figure \ref{fig:sigma}(b). If we provide a small number of hidden units, the model will suffer from poor generalization capability. Using a large number of hidden units will make the model easier to fit the training set. For SGDA (non-private) training, it is often helpful to apply a large number of hidden units, as long as the model does not overfit. In agreement with this intuition, we find the model performance improves with increasing hidden units in Figure \ref{fig:sigma}(b). However, for DP-SGDA training, more hidden units increase the sensitivity of the gradients, which leads to more noise added at each update. Therefore, in contrast to the non-private setting, we find the AUC performance decreases when the number of hidden units increases.   

\textbf{Different Mini-Batch Size.} 
From Theorem \ref{thm:moments-accountant-privacy} and Theorem \ref{thm:sgda-primal-opt}, we find mini-batch size can influence the Gaussian noise variances $\sigma_\wbf^2$ and $\sigma_\vbf^2$ as well as the convergence rate. Selecting the mini-batch size must balance two conflicting objectives.  On one hand, a small mini-batch size may lead to sub-optimal performance. On the other hand, for large batch sizes, the added noise has a smaller relative effect. Therefore, we show the AUC score for DP-SGDA with different mini-batch sizes in Figure \ref{fig:sigma}(c). The experimental results show that the mini-batch size has a relatively large impact on the AUC performance when the mini-batch size is small. 

% \textbf{Different Projection Settings.} 


% \vspace{-6pt}
\section{Conclusion}\label{sec: conclusion}
% \vspace{-6pt}
In this paper, we have used algorithmic stability to conduct utility analysis of the DP-SGDA algorithm for minimax problems under DP constraints.  For the convex-concave setting, we proved that   DP-SGDA can attain an optimal rate $\mathcal{O}(\frac{1}{\sqrt{n}} + \frac{\sqrt{d\log(1/\delta)}}{n \epsilon})$ in terms of the weak primal-dual population risk while providing $(\epsilon, \delta)$-DP for both smooth and nonsmooth cases. For the nonconvex-strongly-concave case, assuming that the empirical risk satisfies the PL condition we proved the excess primal population risk  of DP-SGDA can achieve a utility bound  $\Ocal\bigl(\frac{1}{n^{1/3}} + \frac{\sqrt{d\log(1/\delta)}}{n^{5/6}\epsilon}\bigr)$.  Experiments on three benchmark datasets  illustrate the effectiveness of DP-SGDA. 


For future work, it would be interesting to improve the utility bound for the nonconvex-strongly-convex setting. It also remains unclear to us how to establish the utility bound for DP-SGDA when gradient clipping techniques are enforced at each iteration. Finally, it would also be interesting to evaluate the performance of DP-SGDA on other motivating examples such as GAN, MDP and robust optimization.

\begin{acknowledgements}
The work is supported by SUNY-IBM AI Alliance Research and NSF grants  (IIS-1816227, IIS-2008532, IIS-2103450, IIS-2110546 and DMS-2110836).  The authors would also like to thank Dr. Guzm\'{a}n and Dr. Boob for helpful discussions on differential privacy for minimax problems and for pointing out a gap in the proof of Lemma \ref{lem:sgda-opt-gap} in the Appendix  in an earlier  version of the paper. 
\end{acknowledgements}

\bibliography{yang_122}

\end{document}

