\section{Methodology}
\label{sec:method}
We now describe our data release algorithms which both satisfy the differential privacy and yield asymptotically optimal solutions to the linear regression task.
We start with the first algorithm \textit{\methodonelong{} (\methodoneshort{})}, which directly applies Gaussian mechanism when releasing the data and then de-biases the Hessian matrix when training the model.
However the de-bias operator introduces the possible inverse of a matrix with small eigenvalues, which severely hurts the performance of the learned model.
We therefore propose a novel dataset release algorithm rather than the directly application to Gaussian mechanism -- \textit{\methodtwolong{} (\methodtwoshort{})}. The model learned from the corresponding released public dataset is also guaranteed to be asymptotically optimal, and, more importantly, avoids the problem of small eigenvalues.  

\subsection{De-biased Gaussian Mechanism (\methodoneshort)}
\label{sec:dgm}
The \methodonelong (\methodoneshort) includes the dataset release algorithm and the corresponding training algorithm. \autoref{alg:dgm} shows the overview and we will introduce them next.

\begin{algorithm}[t]
\textbf{Dataset Release}
\begin{algorithmic}[1]
\State \textbf{Input:} $D=\left[D^1, \cdots, D^m\right], \varepsilon, \delta$.
\For{$j=1, \cdots, m$}
	\State The party $j$ computes $\left(D^{\methodoneupper}\right)^j:= D^j + R^j $, where $R^j\in \bbR^{n\times d_j}$ is a random Gaussian matrix and elements in $R^j$ are i.i.d sampled from $\calN\left(0, 4d_{\rm max}\cdot \sigma_{\varepsilon, \delta}^2\right)$.
\EndFor
\State \textbf{Return:} $D^{\methodoneupper}=:\left[\left(D^{\methodoneupper}\right)^1, \cdots, \left(D^{\methodoneupper}\right)^m\right]$.
\end{algorithmic}
\textbf{Training Algorithm}
\begin{algorithmic}[1]
\State \textbf{Input:} $D^{\methodoneupper}, \varepsilon, \delta$
\State $[X^{\methodoneupper}, Y^{\methodoneupper}] = D^{\methodoneupper}$
\State Compute the de-biased Hessian matrix $\hat{H}_n^{\methodoneupper} := \frac{1}{n}\left(X^{\methodoneupper}\right)^\top X^{\methodoneupper} - { 4d_{\rm max}\sigma_{\varepsilon, \delta}^2\cdot I}$
\State $\hat{\bw}_n^{\methodoneupper{}} := \left(\hat{H}_n^{\methodoneupper}\right)^{-1}\left(\frac{1}{n}\left(X^{\methodoneupper}\right)^\top Y^{\methodoneupper}\right)$.
\State \textbf{Return:} $\hat{\bw}_n^{\methodoneupper{}}$.
\end{algorithmic}

\caption{\methodoneshort{}}
\label{alg:dgm}
\end{algorithm}


\paragraph{Dataset release algorithm.} 
Each party directly applies Gaussian mechanism to their own dataset  $D^j$ ($j=1, \cdots m$) to satisfy the differential privacy.
Consider two neighboring data matrices $D^j$ and $\left(D^j\right)'$ differing at exactly one row with the row index $i$.
Implied by \autoref{ass:bound}, we can compute the sensitivity of the data matrix $D^j$:
$$
\left\lVert D^j - \left(D^j\right)' \right\rVert = \left\lVert D^j_i - \left(D^j_i\right)' \right\rVert  \leq 2\sqrt{d_j} \leq 2\sqrt{d_{\rm max}}.
$$ 
Then each party independently adds a Gaussian noise $R^j$ to $D^j$.  
Entries in $R^j$ are i.i.d sampled from Gaussian distribution $\mathcal{N} (0, 4d_{\rm max}\sigma_{\varepsilon, \delta}^2)$. 

The dataset release algorithm meets the privacy constraints in \autoref{sec:problem_setting}. No random matrix $B$ is shared among different parties. \autoref{lem:gm} guarantees that $\left(D^{\methodoneupper{}}\right)^j$ is $(\varepsilon, \delta)$-differentially private w.r.t. $D^j$ for any $0<\varepsilon\leq 1, \delta>0$.

\paragraph{Training algorithm.} Given the dataset released through the above algorithm, there exists an asymptotic linear regression solution. Denote the feature matrix and the label vector of the private and public joint dataset as $[X, Y]=D$ and $[X^{\methodoneupper}, Y^{\methodoneupper}] = D^{\methodoneupper}$.
Further define $R:=\left[R^1, \cdots, R^m\right]\in \bbR^{n\times (d+1)}$ and split $R$ into $R_X$ and $R_Y$ representing the additive noise to $X$ and $Y$ respectively.

Consider the ordinary least square solution for the public data $X^{\methodoneupper}$ and $Y^{\methodoneupper}$, whose explicit form is:
\begin{equation}
\label{eq:ols_gm_pub}
    \left(\left(X^{\methodoneupper}\right)^\top X^{\methodoneupper}\right)\left(X^{\methodoneupper}\right)^\top Y^{\methodoneupper}.
\end{equation}
Compared with our target solution $
\bw^*=\left(\bbE_{(\bx, y)\sim \calP}\left[\bx\bx^\top\right]\right)^{-1}\bbE_{(\bx, y)\sim \calP}\left[\bx\cdot y\right]
$, we can prove that $\plim_{n\to\infty} \frac{1}{n}\left(X^{\methodoneupper}\right)^\top Y^{\methodoneupper} = \mathbb{E}_{(\bx, y)\sim \calP}\left[\bx\cdot y\right]$ by the concentration of bounded random variables and multivariate normal distribution. Nevertheless, there is a gap between $\plim_{n\to\infty} \frac{1}{n}\left(X^{\methodoneupper}\right)^\top X^{\methodoneupper}$ and $\mathbb{E}_{(\bx, y)\sim \calP}\left[\bx\bx^\top\right]$:
\begin{align*}
&\plim_{n\to\infty} \frac{1}{n}\left(X^{\methodoneupper}\right)^\top X^{\methodoneupper}\\
 &= \plim_{n\to\infty}\frac{1}{n}\left(X^\top X+ X^\top R_X + R_X^\top X + R_X^\top R_X \right)\\
&= \mathbb{E}_{(\bx, y)\sim \calP}\left[\bx\bx^\top\right] + { 4d_{\rm max} \sigma_{\varepsilon, \delta}^2\cdot I},
\end{align*}
where the last equation again holds by the concentration of bounded random variables and multivariate normal distribution.
To reduce the bias ${ 4d_{\rm max} \sigma_{\varepsilon, \delta}^2\cdot I}$, we can revise the solution computation in \autoref{eq:ols_gm_pub} to $\hat{\bw}^{\methodoneupper}_n$ defined as\newline
\resizebox{\linewidth}{!}{
 \begin{minipage}{\linewidth}
\begin{align*}
\label{eq:debias_ols_gm_pub}
    	 \left(\frac{1}{n}\left(X^{\methodoneupper}\right)^\top X^{\methodoneupper}- { 4d_{\rm max}\sigma_{\varepsilon, \delta}^2\cdot I}\right)^{-1} \left(\frac{1}{n}\left(X^{\methodoneupper}\right)^\top Y^{\methodoneupper}\right).
\end{align*}
\end{minipage}
}

The first term is estimated for the inverse of the Hessian matrix $\mathbb{E}_{(\bx, y)\sim \calP}\left[\bx\bx^\top\right]$, which we denote as $(\hat{H}_n^{\methodoneupper})^{-1}$. The asymptotic optimality for the solution $ \hat{\bw}^{\methodoneupper}_n$ is implied by the theorem below and the proof is in the Appendix.
\begin{theorem}
\label{thm:dgm_utility}
	When $\beta\leq c$ for some variable $c$ that is dependent of $\sigma_{\varepsilon, \delta}$, $d$, and $\calP$, but is independent of $n$,
	$$
	\bbP\left[ \lVert \hat{\bw}^{\methodoneupper}_n - \bw^* \rVert > \beta \right] < \exp\left(-\tilde{O}\left( \beta^2 \frac{n}{\sigma_{\varepsilon, \delta}^4d^2d_{\rm max}^2} \right)\right),
	$$
\end{theorem}

\paragraph{Problem of small eigenvalues.}
The expectation of $\hat{H}^{\methodoneupper}_n$ is a positive definite matrix given \autoref{ass:non-singular}, but the sample of $\hat{H}^{\methodoneupper}_n$ itself is not guaranteed. 
With a certain probability, it has small eigenvalues that might lead to explosion when computing its inverse.
In our experiments (\autoref{sec:exp}), we find that $\hat{H}^{\methodoneupper}_n$ suffers from the small eigenvalues even if $n$ is as large as $10^6$. As a result, the model utility is much more inferior than what is guaranteed theoretically. This motivates us to design the second algorithm.

\subsection{Random Mixing prior to Gaussian Mechanism (\methodtwoshort)} 
\label{sec:rmgm}
\begin{algorithm}[t]
\textbf{Dataset Release}
\begin{algorithmic}[1]
\State \textbf{Input:} $D=\left[D^1, \cdots, D^m\right], \varepsilon, \delta, k$.
\State The first party pre-generates a $k\times n$ random matrix $B$ where all entries in $B$ are \textit{i.i.d.} sampled from the distribution with probability $1/2$ for $1$ and $1/2$ for $-1$. Then first party sends the random matrix sample $B$ to all parties.
\For{$j=1, \cdots, m$}
\State The party $j$ computes $\left(D^{\methodtwoupper}\right)^j:= BD^j/\sqrt{k} + R^j $, where $R^j$ is a $k\times d_j$ random matrix and all elements in $R^j$ are \textit{i.i.d.} sampled from the multivariate normal distribution $ \calN\left(0, 4d_{\rm max}\sigma_{\varepsilon, \delta}^2\right)$.
\EndFor
\State \textbf{Return:} $D^{\methodtwoupper}:=\left[\left(D^{\methodtwoupper}\right)^1, \cdots, \left(D^{\methodtwoupper}\right)^m\right]$.
\end{algorithmic}
\textbf{Training Algorithm}
\begin{algorithmic}[1]
\State \textbf{Input:} $D^{\methodtwoupper{}}, \varepsilon, \delta$
\State $[X^{\methodtwoupper}, Y^{\methodtwoupper}]=D^{\methodtwoupper{}}$
\State Compute the ordinary least square solution\newline $\hat{\bw}_n^{\methodtwoupper} := \left(\left(X^{\methodtwoupper}\right)^\top X^{\methodtwoupper}\right)^{-1}\left(X^{\methodtwoupper}\right)^\top Y^{\methodtwoupper}.$
\State \textbf{Return:} $\hat{\bw}_n^{\methodtwoupper}$.
\end{algorithmic}

\caption{\methodtwoshort{}}
\label{alg:randproj}
\end{algorithm}

In previous method's dataset release stage, when we directly add the Gaussian additive noise $R$ to the data, in order to guarantee DP, the norm of the noise needed has to be the same order (in $n$) as the norm of the data matrix $D$. Both $D$ and $R$ have norm in $\Theta(\sqrt{n})$. Thus later in the training stage, the additive noise $R$ when compared to the data matrix $X$ would not diminish as $n\to\infty$ and we have to subtract $4d_{\rm max}\sigma_{\varepsilon, \delta}^2\cdot I$ from $\left(X^{\methodoneupper{}}\right)^\top X^{\methodoneupper{}}$ to remove this additive noise in order to obtain the optimal model weights. This subtraction is the problematic part that brings training instability (small eigenvalues in the Hessian matrix).

Instead, we can avoid such subtraction in the training stage by imposing a smaller noise in the data release stage. If we can design the data release stage properly, 
so that the addictive noise has relatively smaller order in $n$ than $D$, in the later training stage, the learner would no longer need the problematic de-biasing step. 

\autoref{alg:randproj} shows the full details of Random Mixing prior to Gaussian Mechanism for Ordinary Least Squares (\methodtwoshort{}). We now explain the design of data release and training algorithm based on the above insights. 

\paragraph{Dataset release algorithm.} 
Suppose $\bb$ is an $n$-dimensional vector in $\{-1, 1\}^{n}$. 
For any two neighbouring daasets $D^j$ and $\left(D^j\right)'$ that are different at row index $i$, the sensitivity of $\bb^\top D^j$ is \newline
\resizebox{\linewidth}{!}{
 \begin{minipage}{\linewidth}
\begin{align*}
\left\lVert \bb^\top D^j - \bb^\top \left(D^j\right)'\right\rVert = \left\lVert D_i^j - \left(D^j_i\right)'\right\rVert \leq 2\sqrt{d_j}\leq 2\sqrt{d_{\rm max}}.
\end{align*}
\end{minipage}
}
Moreover, when $B\in\{-1, 1\}^{k\times n}$, $BD^j/\sqrt{k}$ has sensitivity $2\sqrt{d_{\rm max}}$ as well. 

We now introduce the data release algorithm. Suppose all parties are sharing a random matrix $B\in\{-1, 1\}^{k\times n}$, where all elements in $B$ are \textit{i.i.d.} sampled from the distribution with probability $1/2$ for $1$ and $1/2$ for $-1$. Then we define the local computation for each party $j$:
$$\left(D^{\methodtwoupper}\right)^j:= BD^j/\sqrt{k} + R^j,$$
where $R^j$ is a $k\times d_j$ random matrix and all elements in $R^j$ are \textit{i.i.d.} sampled from the multivariate normal distribution $ \calN\left(0, 4d_{\rm max}\sigma_{\varepsilon, \delta}^2\right)$. Gaussian mechanism guarantees for any fixed $B\in\left\{1, -1\right\}^{k\times n}$, $\left(D^{\methodtwoupper}\right)^j$ is $(\varepsilon, \delta)$-differentially private \textit{w.r.t.} the dataset $D^j$ for $0<\varepsilon\leq 1, \delta>0$. 

Importantly, now the addictive noise $R^j$ is relatively small than $BD^j/\sqrt{k}$. 
The order of $\lVert R^j \rVert$ is $\Theta(k)$ while the order of
$\left\lVert BD^j/\sqrt{k} \right\rVert \approx \lVert D^j \rVert$ is $\Theta(n)$ (by JL Lemma).
If we set $k = o(n)$, the additive noise compared to the original data matrix $D$ will diminish as $n\to \infty$.
This implies that the standard ordinary least square solution to the public dataset $[X^{\methodtwoupper}, Y^{\methodtwoupper}]$ would converge to the optimal solution $\bw^*$ without special subtraction.

\paragraph{Training algorithm.}
Given the feature matrix $X^{\methodtwoupper}$ and the label vector $Y^{\methodtwoupper}$ from the released dataset, we show that the vanilla ordinary least square solution
$$\hat{\bw}_n^{\methodtwoupper} := \left(\left(X^{\methodtwoupper}\right)^\top X^{\methodtwoupper}\right)^{-1}\left(X^{\methodtwoupper}\right)^\top Y^{\methodtwoupper}$$
is asymptotically optimal, i.e. $\plim_{n\to\infty}\hat{\bw}_n^{\methodtwoupper{}} = \bw^*$.

To prove the above asymptotic optimality, we show $\plim_{n\to \infty}\left(X^{\methodtwoupper}\right)^\top X^{\methodtwoupper} = \bbE_{(\bx, y)\sim\calP}\left[\bx\bx^\top\right]$ and $\plim_{n\to \infty}\left(X^{\methodtwoupper}\right)^\top Y^{\methodtwoupper} = \bbE_{(\bx, y)\sim\calP}\left[\bx\cdot y\right]$ respectively, and together they prove the optimality.

Define $R=\left[R^1, \cdots, R^m\right]\in \bbR^{k\times  (d+1)}$ and split $R$ into $R_X$ and $R_Y$ representing the additive noises to $BX/\sqrt{k}$ and $BY/\sqrt{k}$ respectively. 
Because $\plim_{n\to\infty}\frac{1}{n}X^\top X=\bbE_{(\bx, y)\sim \calP}\left[\bx\bx^\top\right]$, it is sufficient to show $\plim_{n\to\infty}\frac{1}{n}\left(X^{\methodtwoupper}\right)^\top X^{\methodtwoupper} - \frac{1}{n}X^\top X=\mathbf{0}$. Now we decompose $\frac{1}{n}\left(X^{\methodtwoupper}\right)^\top X^{\methodtwoupper} - \frac{1}{n}X^\top X$ as below:
\begin{align*}
\tiny
    \underbrace{\frac{1}{n}\left(X^\top \frac{B^\top B}{k}X - X^\top X\right)}_\text{\autoref{lem:jl}} + \underbrace{\frac{1}{n}\left(X^\top \frac{B^\top}{\sqrt{k}} R_X + R_X^\top \frac{B}{\sqrt{k}}X\right)}_\text{cvg. of gauss. dist.}
	 + \underbrace{\frac{1}{n}R_X^\top R_X}_\text{$k =o(n)$}.
\end{align*}
We informally show how each term converges to $\mathbf{0}$ as $n\to\infty$:
\begin{enumerate}[leftmargin=*,nosep]
	\item $\frac{1}{n}\left(X^\top \frac{B^\top B}{k}X - X^\top X\right)$. If $k\to\infty$ as $n\to\infty$, the convergence is directly implied by \autoref{lem:jl}.
	\item $\frac{1}{n}\left(X^\top \frac{B^\top}{\sqrt{k}} R_X + R_X^\top \frac{B}{\sqrt{k}}X\right)$. Properties of normal distribution guarantees the approximation $ \frac{B^\top}{\sqrt{k}} R_X\approx R_X'$, where $R_X'\in \bbR^{n\times d}$ is a Gaussian matrix with $\mathcal{N}  \left(0, 4d_{\rm max}\sigma_{\varepsilon, \delta}^2\right)$. Then $\left\lVert\frac{1}{n}X^\top \frac{B^\top}{\sqrt{k}} R_X\right\rVert \approx \lVert\frac{1}{n}X^\top R_X'\rVert = O\left(\frac{1}{\sqrt{n}}\right)$. 
	\item $\frac{1}{n}R_X^\top R_X$. If $k\to\infty$ as $n\to\infty$, $\left\lVert \frac{1}{k} R_X^\top R_X \right\rVert$ will converge to $4d_{\rm max} \sigma_{\varepsilon, \delta}^2\cdot I$. On the other hand, when $k = o(n)$, $\frac{k}{n}\cdot 4d_{\rm max}\sigma_{\varepsilon, \delta}^2\cdot I$ will converge to $\mathbf{0}$ as $n\to \infty$.
\end{enumerate}
Notice that the above convergence relies on the proper selection of $k$. 
There exists a trade-off: larger $k$ leads to better convergence rate of the first term, but worse rate for the diminishing of additive noise -- the third term. The following theorem shows the exact asymptotic rate:

\begin{theorem}
\label{thm:rmgm_utility}
	When $\beta\leq c$ for some variable $c$ that is dependent of $d$ and $\calP$, but independent of $\sigma_{\varepsilon, \delta}$,  $n$, we have \newline
	\resizebox{\linewidth}{!}{
  \begin{minipage}{\linewidth}
	\begin{align*}
	&\bbP\left[ \lVert \hat{\bw}^{\methodtwoupper{}}_n - \bw^* \rVert > \beta \right] < \\  
	& \exp\left(- O\left(\min\left\{ \frac{k\beta^2}{d^2}, \frac{n\beta}{kdd_{\rm max}\sigma_{\varepsilon, \delta}^2}, \frac{n^{1/2}\beta}{dd_{\rm max}^{1/2} \sigma_{\varepsilon, \delta}}\right\}\right) + \tilde{O}(1) \right).
	\end{align*}
	  \end{minipage}
}
	If we choose $k=O\left(\frac{n^{1/2}d^{1/2}}{d_{\rm max}^{1/2}\sigma_{\varepsilon, \delta}}\right)$, then
	\begin{align*}
	&\bbP\left[ \lVert \hat{\bw}^{\methodtwoupper{}}_n - \bw^* \rVert > \beta \right] < \\
	&\exp\left(- \frac{n^{1/2}\beta}{d^{3/2}d_{\rm max}^{1/2}\sigma_{\varepsilon, \delta}} \cdot O\left(\min\left\{ 1, \beta \right\}\right) + \tilde{O}(1) \right).
	\end{align*}
\end{theorem}
In the theorem, $k$ is selected to balance $ \frac{k\beta^2}{d^2}$ and $\frac{n\beta}{kdd_{\rm max}\sigma_{\varepsilon, \delta}^2}$. 
To achieve the optimal rate for $f(\beta)$ with any fixed $\beta$, the optimal $k$ is chosen as $O\left(\frac{n^{1/2}d^{1/2}}{d_{\rm max}^{1/2}\sigma_{\varepsilon, \delta}}\right)$.

\textbf{Comparison with \methodoneshort.}
The near-zero eigenvalue issue is solved since $\left(X^{\methodtwoupper}\right)^\top X^{\methodtwoupper}\succeq \mathbf{0}$ holds naturally by its definition.
Moreover, although the convergence rate of $n$ is sacrificed, the orders in $d, d_{\rm max}$ and $\sigma_{\varepsilon, \delta}$ are much improved.
In \autoref{sec:exp} we show that the  \methodtwoshort{} outperforms \methodoneshort{} on both synthetic datasets even when $n$ is as large as $3\times 10^6$. 
