# Impact of Dataset Properties on Membership Inference Vulnerability of Deep Transfer Learning

This repository contains the code to reproduce the experiments carried out in Impact of Dataset Properties on Membership Inference Vulnerability of Deep Transfer Learning.

This code repository builds on-top of the code of [On the Efficacy of Differentially Private Few-shot Image Classification](https://arxiv.org/pdf/2302.01190.pdf).

## Dependencies
This code requires the following:
* Python 3.8 or greater
* PyTorch 1.11 or greater (most of the code is written in PyTorch)
* opacus 1.3 or greater
* optuna 3.0 or greater
* TensorFlow 2.8 or greater (for reading VTAB datasets)
* TensorFlow Datasets 4.5.2 or greater (for reading VTAB datasets)
* statsmodels (for fitting the linear regression model)
* pandas and numpy

## Source Code Libraries
In this work codebase, we rely on the following open source code libraries, some of which we have modified:
- TIMM (for the PyTorch VIT-B implementation): Copyright 2020 Ross Wightman https://github.com/rwightman/pytorch-image-models
- Big Transfer (for the R-50 implementation): Copyright 2020 Google LLC https://github.com/google-research/big_transfer
- Tensorflow Privacy (for the LiRA implementation): Copyright 2022, The TensorFlow Authors https://github.com/tensorflow/privacy
- cambridge-mlg/dp-few-shot (for the caching of features) https://github.com/cambridge-mlg/dp-few-shot
- privacytrustlab/ml_privacy_meter (for RMIA): https://github.com/privacytrustlab/ml_privacy_meter

## GPU Requirements
The experiments in the paper are executed on CPU (Head experiments) or a NVIDIA V100 GPU with 40 GB (FiLM experiments).

## Installation for LiRA on ViT-B/R-50 (Head) (Section 4)
### Installation
The following steps will take a considerable length of time and disk space.

1. Clone or download this repository.
2. Install the dependencies listed above.
3. The experiments use datasets obtained from [TensorFlow Datasets](https://www.tensorflow.org/datasets).
   The majority of these are downloaded and pre-processed upon first use. However, the
   [Resisc45](https://www.tensorflow.org/datasets/catalog/resisc45) dataset needs to be
   downloaded manually. Click on the links for details.
4. Switch to the ```src``` directory in this repo and download the BiT pretrained model:

   ```wget https://storage.googleapis.com/bit_models/BiT-M-R50x1.npz```
5. Copy [timm folder](https://github.com/cambridge-mlg/dp-few-shot/tree/main/src/timm) to `section4/timm`.

## Cached head LiRA
It is more computationally efficient to cache feature representations and load them. Thus, only a final last layer has to be trained.

### Cache feature representations

Use `section4_training/feature_space_cache/map_to_feature_space.py` to save representations from datasets in feature dimension. This has to be only done once for each dataset. E.g.,

```
python3 -m feature_space_cache.map_to_feature_space 
    --feature_extractor vit-b-16 
    --dataset cifar10 
    --examples_per_class -1 
    --download_path_for_tensorflow_datasets [PATH] 
    --feature_dim_path [feature_dim_path] 
```

### Run LiRA on Head models

Use the functions in `section4_training/lira/run_lira.py` to load the data, train a head models and generate intermediate LiRA data. E.g.,

```
python3 -m lira.run_lira 
    --record_l2_norms
    --n_classes -1
    --data_seed 0 
    --shots 16
    --target_epsilon -1
    --seed 0 
    --feature_extractor vit-b-16
    --dataset cifar10 
    --num_shadow_models 256
    --number_of_trials 20 
    --data_path [data path]
    -c [checkpoint_dir]
```


Use the functions in `section4_training/lira/process_lira.py` to process the intermediate LiRA data files.

### Run RMIA on Head models (Section 4.2)

We utilize the output of the LiRA training (including the logits) and use `section4_training/rmia/split_into_files.py` to build up the same folder structure (seperate folder by model) that RMIA requires.

We run the RMIA attack with the [code by the authors](https://github.com/privacytrustlab/ml_privacy_meter/tree/master/research/2024_rmia) at git hash `173d4ad`. We added an example config for RMIA to `section4_training/rmia/example_config_rmia.yaml`. Follow the instructions in that repository to setup and run the attack.

## LiRA on R-50 FiLM (Section 4.3) 

Follow the instructions in the On the Efficacy of Differentially Private Few-shot Image Classification [repository](https://github.com/cambridge-mlg/dp-few-shot) to train FiLM models.

## Prediction Model (Section 4.3)

Use the functions in `section4_prediction_model/predict_mia_dataset.py` to train a prediction model.

The dataframe with the data needs to include at least the following columns:
- `n_classes`, `shots` and `feature_extractor`
- `0.1`, `0.01`, `0.001`, `0.0001`, `0.00001` specifying the TPR with the column name being the FPR.

## Individual MIA vulnerability (Section 4.4)

Run `produce_fig6.py`after setting the path in the code to the outputs of LiRA to produce Figure 6.

## Comparison between empirical models and universal DP bounds (Section 4.5)

Run `produce_tab1.py`to produce Table 1.
