%!TEX root=../aistats_main.tex
\section{Introduction}
\begin{figure}[t!]
\includegraphics[width=\linewidth]{figs/motivated_example}
\caption{An illustration of how data is distributed in the health care example. Clinics have the same set of patients, but different attributes such as blood test results, CT images and the degree of liver cancer.}	
\label{fig:motivating_example}
\end{figure}

The machine learning community has greatly benefited from open and public datasets~\citep{chapelle2011yahoo, real2017youtube, fast2017long, klcw20}. Unfortunately the privacy concern of data release significantly limits the feasibility of sharing many rich and useful datasets to the public, especially in privacy-sensitive domains like health care, finance, and government etc. This restriction considerably slows down the research in those areas as well as the general machine learning research given many of today's algorithms are data-hungry. Recently, legal and moral concerns on protecting individual privacy become even greater. Most countries have imposed strict regulations on the usage and release of sensitive data, e.g. CCPA~\citep{ccpa}, HIPPA~\citep{act1996health} and GDPR~\citep{GDPR}. The tension between protecting privacy and promoting research drives the community as well as many ML practitioners into a dilemma.

\textit{Differential privacy} (DP) 
\citep{dwork2011firm, dwork2006calibrating, dwork2014algorithmic, sheffet2017differentially, lee2019synthesizing, xu2017dppro, kenthapadi2012privacy} is shown to be a promising direction to release datasets while protecting individual privacy. DP provides a formal definition of privacy to regulate the trade-off between two conflicting goals: protecting sensitive information and maintaining data utility. 
In a DP data release mechanism, the shared dataset is a function of the aggregate of all private samples and the DP guarantees regulate how difficult for anyone to infer the attributes or identity of any individual sample. 
With high probability, the public data would be barely affected if any single sample were replaced. 

Despite the ongoing progress of DP data release, the majority of the prior work mainly focuses on the single-party setting which assumes there is only one party that would release datasets to the public. However in many real-world scenarios, there exist multiple parties who own data relevant to each other and want to collectively share the data as a whole to the public. For example, in health care domain, some patients may visit multiple clinics for specialized treatments (Figure~\ref{fig:motivating_example}), and each clinic only has access to its own attributes (e.g. blood test and CT images) collected from the patients. For the same set of patients, attributes combined from all clinics can be more useful to train models. In general, the multi-party setting assumes multiple parties own disjoint sets of attributes (features or labels) belonging to the same group of data subjects (e.g. patients). 

One straightforward approach to release data in a multi-party setting is combining data from all parties in a centralized place (e.g. one of the data owners or a third-party), and then releasing it using a private single-party data release approach. However, in a privacy-sensitive organization like a clinic, sending data to another party is prohibited by policy. An alternative approach is to let each party individually release its own data to the public through adding sample-wise Gaussian noise, and then ML practitioners can combine the data together to train models. However the resulting models trained on the data combined in this way would show a significantly lower utility compared to the models trained on non-private data (confirmed by experiments in Section~\ref{sec:exp}). To bridge this utility gap, we propose new algorithms specifically designed for multi-party setting.

In summary, we study DP data release in multi-party setting where parties share attributes of the same data subjects publicly through a DP mechanism. 
It protects the privacy of all data subjects and can be accessed by the public, including any party involved. To this end, we propose the following two differentially private algorithms, both based on Gaussian DP Mechanism~\citep{dwork2014algorithmic} within the context of linear regression. 
First, in \textit{\methodonelong (\methodoneshort)}, each party adds Gaussian noise directly to its data. 
The learner with the public data is able to remove a calculated bias from the Hessian matrix. 
However, we show that bias removal brings the small eigenvalue problem.
Hence, we propose the second method \textit{\methodtwolong (\methodtwoshort)}. 
A random Bernoulli projection matrix is shared to all parties, and each party uses it to project its data along sample-wise dimension before adding Gaussian noise. 
We prove that both algorithms are guaranteed to produce solutions that asymptotically converge to the optimal solutions (i.e. non-private) as the dataset size increases.
Through extensive experiments on both synthetic and real-world datasets, we show the latter method achieves the theoretical claims and outperforms the first method that naively adapts Gaussian mechanism.