# Secure Distributed DPHelmet

This archive contains code to reproduce the `DP_SVM_SGD` and `DP_Softmax_SLP_SGD` accuracy numbers for ICLR Submission "Distributed DPHelmet: Differentially Private Non-interactive Convex Blind Averaging".

As we rely on output perturbation, we add multivariate Gaussian noise to the SVMs after running SVM_SGD and Softmax_SLP_SGD. Hence, we provide the code that we used for training the SGD_SVM (see `SingleInvocationDPHelmet.py`) and Softmax_SLP_SGD (`SingleInvocationDPHelmet_softmax.py`). As discussed in `Multivariate Gaussian noise calibration`, we ran `dphelmet_tight_adp.py` to get tight ($\varepsilon$, $\delta$)-DP bounds for the non-subsampled multivariate Gaussian Mechanism. For subsampled Gaussians like used in DP-FL, we ran `privacy_buckets_dpsgd.py` instead.

## Requirements
* We used Python 3.x.
* We used Tensorflow 2.x.

## Instructions

Run `extract_embeddings.py` first which creates two embedding files: `code_space.npy` as well as `labels.npy`.
Afterward run `SingleInvocationDPHelmet.py` to train the distributed `DP_SVM_SGD` which generates a file `tests_dphelmet_<datetime>.csv` listing all experiment results i.e., the accuracies as well as f1-scores (macro) after noise. For generating figures 3,4, and 5 of the paper for the DPHelmet variant only, refer to `example_viz.py`. Note that this part requires adaption dependent on the particular experiment configuration in use.

### Resources / Runtimes
Expect this algorithm to take some time. In addition to the CIFAR-10 dataset (~162MB), it has to download about 3.3GB for the pre-trained model.
Extracting the embeddings is highly dependent on your GPU resources and takes about an hour with good resources. Try lowering the batch size if too much GPU-RAM is consumed.

Running the cross-validation search of distributed DP-Helmet requires some additional CPU resources. Expect about 10min for one parameter configuration (for 1000 users and 50 data points each).
The current default is 48 hyperparameter configurations with 12 runs each. Thus it is highly recommended to parallelize by changing the `N_PROCESSES` to your liking.

The last part of the cross-validation search (where the noise is added) additionally consumes a few hours depending on the number of parameters (this part is not parallelized).

## Reconstructing DP-FL

For DP-FL, we are gracious and assume a noise overhead of only sqrt(#users), as we are not aware of any techniques (short of SMPC) that achieve less than sqrt(#users) noise overhead.

We did not automate the DP-FL code. We only use standard tools, though. Here are the steps that are needed to reconstruct the DP-FL results.

1. Extract the CIFAR-10 embeddings via `extract_embeddings.py`.
2. Download and install opacus v0.15.0 (e.g. via `pip3 install opacus==0.15.0`)
3. Run our DP-FL code (modify `sigma`-hyperparameter for different privacy budgets). Results are saved at `run_results_***.npy`.

        python3 dpsgd_cifar10_opacus.py

4. Find a better privacy budget by running the provided privacy bucket program.

        git clone https://github.com/sommerda/privacybuckets.git
        python3 privacy_buckets_dpsgd.py


## Multivariate Gaussian noise calibration

In the code where we construct the `DP_SVM_SGD` and `DP_Softmax_SLP_SGD` classification accuracy results, get tight ($\varepsilon$, $\delta$)-DP bounds with `dphelmet_tight_adp.py`.

The tool provides tight sequential composition bounds for one-dimensional output perturbation mechanisms. As a d-dimensional spherical multivariate Gaussian distribution (i.e., with a diagonal covariance matrix) can be represented as product distribution (i.e., the joint distribution) of d identical (and independent) 1-dimensional Gaussian distributions, the leakage of the spherical multivariate Gaussian mechanism, is the same as the sequential composition of 1-dimensional Gaussian distributions.

To reproduce our results, please follow these steps:

1. Run our TightADP script

        python3 dphelmet_tight_adp.py
