\section{Experimental Evaluation}
\label{sec:exp}

\begin{figure*}[t!]
\centering
\includegraphics[width=\linewidth]{figs/cdf_figs.pdf}
\caption{$\mathbb{P}\left[\lVert \hat{\bw}_n - \bw^* \rVert > \beta\right]$ and $\mathbb{E}\left[\lVert \hat{\bw}_n - \bw^* \rVert\right]$ as dataset size $n$ increases for different algorithms when $\varepsilon=1.0, 0.3, 0.1$. For all pairs of $(\varepsilon, \beta)$ except two most extreme cases $(0.3, 0.1)$ and $(0.1, 0.1)$, \methodtwoshort{} shows asymptotic tendencies $\plim_{n\to\infty}\mathbb{P}\left[\lVert \hat{\bw}_n^{\methodtwoupper{}} - \bw^* \rVert > \beta\right]=0$. \methodoneshort{} does not show such tendencies even when training set size $n$ is as large as $3\times 10^6$.}
\label{fig:cvg_dis}
\end{figure*}

\begin{figure*}[t!]
\centering
\includegraphics[width=0.9\linewidth]{figs/eigenvalues.pdf}
\caption{Scatter plots of $\ell_2$ distance versus minimum absolute eigenvalue of Hessian matrix. The left figure is for the synthetic dataset when $n=10^6$ and $\varepsilon=1.0$. The right figure is for the \textit{Insurance} dataset when $\varepsilon=1.0$. Each point is processed by a different random seed for \methodoneshort{} and \methodtwoshort{}. Both figures show that the Hessian matrix in \methodoneshort{} is more likely to have small eigenvalues, which further lead to large distance $\lVert \hat{\bw}_n - \bw^* \rVert_2$.}
\label{fig:eigenvalue}
\end{figure*}

In this section, we evaluate \methodoneshort{} and \methodtwoshort{} on both synthetic and real world datasets. 
Our  experiments on synthetic dataset are designed to verify the theoretical asymptotic results in \autoref{sec:method} by increasing the training set size $n$.
We further justify the algorithm performance on five real-world datasets, four from UCI Machine Learning Repository\footnote{\url{https://archive-beta.ics.uci.edu/ml/datasets}}~\citep{Dua:2019} and one from kaggle.

\subsection{Experiment Set-up}
\paragraph{Algorithm set-up.} We evaluate both \methodoneshort{} and \methodtwoshort{}. For $k$ in \methodtwoshort{},
we set $k = \frac{\sqrt{n}}{\sigma_{\varepsilon, \delta}}$ in synthetic dataset experiments and select the best $k$ from $\{10^2, 3\times 10^2, 10^3, 3\times 10^3, 10^4\}$ in real-world dataset experiments. Because of the numerical instability of computing Hessian inverse mentioned early, we add small $\lambda \cdot I$ with $\lambda=10^{-5}$ to all Hessian matrices. 

\paragraph{Baseline.} In addition, we consider the following baselines to help qualify the performance of proposed algorithms.
\begin{itemize}[leftmargin=*,nosep]
	\item \textit{OLS}: The explicit solution for linear regression given training data $(X, Y)$ and serves as the performance's upper bound for private algorithms, i.e. non-private solution.
	\item \textit{\methodthreelong (\methodthreeshort)}: The same data release algorithm in \methodoneshort{}, but has a different training algorithm. Given a released dataset $(X^{\methodthreeupper}, Y^{\methodthreeupper})$ by Gaussian mechanism, \methodthreeshort{} outputs the vanilla ordinary least square solution $\hat{\bw}_n^{\methodthreeupper{}}=\left(\left(X^{\methodthreeupper}\right)^\top X^{\methodthreeupper}\right)^{-1}\left(X^{\methodthreeupper}\right)^\top Y^{\methodthreeupper}$. In other words, it is \methodoneshort{} without training debiasing.
\end{itemize}

\paragraph{Evaluation metric.} 
In the experiments on synthetic datasets, we estimate the probability of the $\ell^2$ distance between the model weights $\hat{\bw}_n$ from each algorithm or baseline and the ground truth model weight $\bw^*$:
$$
    \bbP\left( \left\lVert \hat{\bw}_n - \bw^* \right\rVert > \beta \right).
$$
We also evaluate the expectation of the $\ell^2$ distance between weights for different algorithms:
$$
\bbE\left\lVert \hat{\bw}_n - \bw^* \right\rVert.
$$
If an algorithm is asymptotically optimal, we can see both $\bbP\left( \left\lVert \hat{\bw}_n - \bw^* \right\rVert > \beta \right)$ and $\bbE\left\lVert \hat{\bw}_n - \bw^* \right\rVert$ converge to $0$ when $n$ increases.

For the experiments on real world datasets, we evaluate learned models $\hat{\bw}_n$ by the mean squared loss on the test set.

\subsection{Evaluation on Synthetic Datasets}
\paragraph{Data generation.} 
We define the feature dimension $d=10$. 
Each weight value of the ground truth linear model $\bw^*$ is independently sampled from uniform distribution between $-1/d$ and $1/d$.
A single data point $(\bx, y)$ is sampled as the following: each feature value in $\bx$ is independently sampled from a uniform distribution between $-1$ and $1$; label $y$ is computed as $\left(\bw^*\right)^\top\bx$. Two assumptions for the data distribution $\calP$, \autoref{ass:bound} and \autoref{ass:non-singular}, can be verified.
Moreover, we set $6$ parties in total, $5$ of which have $2$ attributes and the remaining one has $1$ attribute. 

\paragraph{Results.} 
We vary the training set size $n\in\{10^4, 3\times10^4, 10^5, 3\times 10^5,10^6,3\times10^6\}$ and privacy budget $\varepsilon\in\{1, 0.3, 0.1\}$ with fixed $\delta=10^{-5}$.
We estimate the $\mathbb{P}\left[\lVert \bw_n - \bw^* \rVert > \beta\right]$ and $\mathbb{E}\lVert \bw_n - \bw^* \rVert$ for different algorithms with $1000$ random seeds. 
\autoref{fig:cvg_dis} shows how $\mathbb{P}\left[\lVert \bw_n - \bw^* \rVert > \beta\right]$ and $\mathbb{E}\lVert \bw_n - \bw^* \rVert$ of each algorithm change when training set size $n$ increases. 

Regarding two baselines, $\mathbb{P}\left[\lVert \bw_n - \bw^* \rVert > \beta\right]$ of OLS solutions, without any private constraint, are close to the ground truth $\bw^*$ under all $\beta$ with probability 0.
Nonetheless, $\mathbb{P}\left[\lVert \bw_n - \bw^* \rVert  >\beta\right]$ of  \methodthreeshort{} keeps mostly unchanged as $n$ increases. 
Especially, $\mathbb{P}\left[\lVert \bw_n - \bw^* \rVert > 0.1\right]$ stays at $1$ for all $n$.
Such results are expected in BGM-OLS's convergence: $\plim_{n\to\infty} \frac{1}{n}\left(X^{\methodthreeupper}\right)^\top X^{\methodthreeupper} = \bbE_{\bx}\left[ \bx\bx^\top \right] + 4d_{\rm max}\sigma_{\varepsilon, \delta}^2\cdot I$, which introduces a non-diminishing bias $4d_{\rm max}\sigma_{\varepsilon, \delta}^2\cdot I$.

Next, we compare \methodoneshort{} and  \methodtwoshort{}. \methodtwoshort{} outperforms \methodoneshort{} at both the convergence of probability $\mathbb{P}\left[\lVert \bw_n - \bw^* \rVert > \beta\right]$ (the first three figures in \autoref{fig:cvg_dis}) and the expected distance $\mathbb{E}\left[\lVert \bw_n - \bw^* \rVert\right]$ (the last figure in \autoref{fig:cvg_dis}).
\methodtwoshort{} shows the asymptotic tendencies in all values of $\beta$ when $\varepsilon=1.0$.
Although \methodoneshort{} has better rate at $n$ than \methodtwoshort{} theoretically, $n=3\times 10^6$ is not large enough to show the asymptotic tendencies for \methodoneshort{}.

\begin{table*}[t!]
\centering
\resizebox{\linewidth}{!}{
\begin{tabular}{c|c|c|ccc|ccc|ccc}
\toprule
\multirow{3}{*}{\textbf{Dataset}} & \multirow{3}{*}{\textbf{Statistics}} &\multicolumn{10}{c}{\textbf{Method}} \\
\cmidrule{3-12}
&   &   \multirow{2}{*}{OLS} & \multicolumn{3}{c|}{$\varepsilon=1.0$} & \multicolumn{3}{c|}{$\varepsilon=0.3$} & \multicolumn{3}{c}{$\varepsilon=0.1$} \\
 &  &  &  \methodonebrev{}     &   \methodtwobrev{}    &  \methodthreebrev{}    &  \methodonebrev{}     &   \methodtwobrev{}    &  \methodthreebrev{}   &   \methodonebrev{}     &   \methodtwobrev{}    &  \methodthreebrev{}    \\
 \midrule
Insurance & $n=1070, ~d=9, ~m=5$ & $0.008$ & $0.7015$ & $\mathbf{0.0791}$ & $0.0805$ & $0.7550$ & $\mathbf{0.0782}$ & $0.0850$ & $0.7263$ & $\mathbf{0.0793}$ & $0.0832$\\
Bike & $n=13903, ~d=13, ~m=5$ & $0.017$ & $0.8105$ & $\mathbf{0.0581}$ & $0.0691$ & $0.9080$ & $0.0711$ & $\mathbf{0.0703}$ & $0.8792$ & $\mathbf{0.0700}$ & $0.0707$\\
Superconductor & $n=17010, ~d=81, ~m=10$  & $0.009$ & $0.9794$ & $\mathbf{0.0659}$ & $0.0670$ & $1.0075$ & $0.0707$ & $\mathbf{0.0704}$ & $0.9220$ & $\mathbf{0.0699}$ & $0.0704$ \\
GPU & $n=193280, ~d=14, ~m=5$ & $0.007$ & $0.6953$ & $\mathbf{0.0137}$ & $0.0158$ & $0.7843$ & $\mathbf{0.0160}$ & $\mathbf{0.0160}$ & $0.7822$ & $0.0165$ & $\mathbf{0.0160}$ \\
Music Song & $n=412276, ~d=90, ~m=10$  & $0.011$ & $1.0167$ & $\mathbf{0.0202}$ & $0.7194$ & $1.6462$ & $\mathbf{0.1039}$ & $0.7479$ & $1.5583$ & $\mathbf{0.5654}$ & $0.7508$\\
 \bottomrule
\end{tabular}
}
\caption{Mean squared losses on real world datasets. \methodtwoshort{} achieves the lowest losses in most settings (12 out of 15).}
\label{tab:real_world_result}
\end{table*}

\methodoneshort{} is even much worse than \methodthreeshort{}, which is almost random guess.
It is caused by the small eigenvalue issue discussed in \autoref{sec:method}.
To illustrate it, Figure \ref{fig:eigenvalue}~(a) shows the scatter plot, where the $x$-axis is minimum eigenvalues of the Hessian matrix $\hat{H}_n$ and $y$-axis is the distance between our solutions and the optimal solution $\lVert \hat{\bw}_n - \bw^* \rVert$. 
Each point is processed by a different random seed for \methodoneshort{} and \methodthreeshort{} when $n=10^6$ and $\varepsilon=1.0$.
$\lVert \hat{\bw}_n - \bw^* \rVert$ and the minimum absolute eigenvalues of $\hat{H}_n$ have a strong positive correlation.
With a certain probability, the minimum eigenvalue of \methodoneshort{} is smaller than $10^{-2}$ and corresponding $\lVert \bw_n - \bw^* \rVert$ is larger than $10$.

Overall \methodtwoshort{} has the best empirical performance across various settings of $\varepsilon$ and $n$ on the synthetic data, as its asymptotically optimality is verified and it consistently outperforms two other private algorithms when $n$ is large enough. Though \methodoneshort{} seems to have stronger theoretical guarantee in the aspect of rate in $n$, its poor empirical performance comes from two aspects: 1. small eigenvalues occur due to the design of the training algorithm; 2. extremely large $n$ is necessary to show the asymptotic optimality due to the worse rates of $d, d_{\rm max}$ and $\sigma_{\varepsilon, \delta}$.

\subsection{Evaluation on Real World Datasets}
\paragraph{Dataset.}
We experiment with five datasets: 
\begin{itemize}[leftmargin=*,nosep]
    \item \textit{Insurance}~\citep{lantz2019machine}: predicting the insurance premium from features including age, bmi, expenses, etc.
    \item \textit{Bike}~\citep{fanaee2014event}: predicting the count of rental bikes from features such as season, holiday, etc.
    \item \textit{Superconductor}~\citep{hamidieh2018data}: predicting critical temperature from chemical features.
    \item \textit{GPU}~\citep{ballester2019sobol, nugteren2015cltune}: predicting Running time for multiplying two $2048\times 2048$. matrices using a GPU OpenCL SGEMM kernel with varying parameters.
    \item \textit{Music Song}~\citep{Bertin-Mahieux2011}: predicting the release year of a song from audio features.
\end{itemize}
We split the original dataset into train and test by the ratio $4:1$.
The number of training data $n$, the number of features $d$ and the number of parties are listed in Table \ref{tab:real_world_result}. The attributes are evenly distributed among parties. All features and labels are normalized into $[0, 1]$.

\paragraph{Results.}
For each dataset, we evaluate OLS and three differentially private algorithms by the mean squared loss on the test split. 
Table \ref{tab:real_world_result} shows the results for $\varepsilon\in\{0.1, 0.3, 1.0\}$ and $\delta=10^{-5}$. 
We can check that the loss of \methodoneshort{} is usually much larger than others and \methodtwoshort{} achieves the lowest losses for most cases (12 out of 15).
Moreover, \autoref{fig:eigenvalue}~(b) shows that \methodoneshort{} has the small eigenvalue problem as well in the real world dataset experiments.
These results are consistent with the results on synthetic dataset.
We therefore recommend \methodtwoshort{} as a practical solution to privately release the dataset and build the linear regression models.
