# Augmented Mixup

The source code for *Augmented Mixup Procedure for Privacy-Preserving Collaborative Training* paper, by M.Plesa et al.

## System configuration

We performed development and measurements on two different platforms, each with a specific hardware accelerators. One system with [CUDA](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-04.html) support having a single Nvidia L4 GPU and the second system with [MPS](https://docs.pytorch.org/docs/stable/notes/mps.html), on a laptop with the Apple M4 Pro chip.

On both systems we had Python version=`3.10.17` and PyTorch version=`2.7.1`. All other Python packages that are needed to run these experiments are provided in [requirements.txt](./src/requirements.txt). This code should also work with newer versions of PyTorch as well.
```bash
cd src/
python3 -m pip install -r requirements.txt
```

> At the time of development, our CUDA system had the toolkit version=`12.8.1-1`.

## Augmented Mixup - Implementation

Our code is available in: [`src/augment.py`](./src/augment.py).

This codebase implements a privacy-preserving data augmentation technique inspired by mixup to create training samples for downstream tasks. It works according to the description given in the paper.

### Usage

```bash
python3 augment.py --help
usage: Singularization [-h] [-k KVAL] [--mf MF] [-m MODEL] [-d DATA] [-e EPOCHS] [-b] [--lr LR] [-v]

Singularization Mixup

options:
  -h, --help            show this help message and exit
  -k KVAL, --kval KVAL  The number of random samples that will be used for generating the mixup dataset.
  --mf MF               Multiplicative factor. It will be used to scale the radius when sampling synthetic embeddings. Check
                        `sample_k_orthogonal_vectors`
  -m MODEL, --model MODEL
                        The architecture of the neural network to be used for feature (embedding) extraction.
  -d DATA, --data DATA  The dataset used for generating samples and training.
  -e EPOCHS, --epochs EPOCHS
                        Number of training epochs.
  -b, --bench           Set a seed and other useful PyTorch settings for the best reproducibility of experiments.
  --lr LR               The learning rate that will be used during training.
  -v, --verbose         Do not save the logs to a file when performing the training and benchmark.
```

Run the script with `-b` flag to enable the "deterministic mode", which is recommended for benchmarks. The `-v` flag will not generate any log files, only printing to console.
```bash
python3 augment.py --model resnet18 --data cifar10 --epochs 2 -k 4 --mf 1 -b -v
```

```bash
python3 augment.py --model resnet34 --data cifar100 --epochs 200 -k 6 --mf 8 -b
```

> Keep in mind that `select_optimal_device()` automatically detects the best device that can be used for training and inference.
```python
if __name__ == "__main__":
    # ========== CONFIGURATION ==========
    batch_size = 128
    device = select_optimal_device()
```

By default, a batch size of 128 is used across the board.

⚠️ If the script is executed on Metal device, it is recommended to run it with `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0`. See more details on this option [here](https://www.codegenes.net/blog/pytorch_mps_high_watermark_ratio/).

### Output and logging

When generating the output logs, all information will be stored within the `src/logs/` directory (safely generated *behind the scenes*). Each log file will have a unique ID (given by the UNIX timestamp at which `augment.py` was started). Having a log ID is a more convenient way to handle multiple experiments with the same parameters.

The model checkpoints that will be saved to `src/checkpoints` after training. 

The method
```python
def prepare_and_validate_features(
        feature_extractor_type: str,
        remove_last_k_layers: int,
        dataset_type: str,
        device: torch.device,
        use_cache: bool) -> Tuple[Tuple[str, str,], Tuple[str, str], int]:
```
from [`augment.py`](./src/augment.py) will handle feature generation by loading the standard dataset (MNIST, CIFAR10, or CIFAR100), transforming them, and finally pass them through a Resnet18 or Resnet34 backbone (only its feature extractor layers). For optimization reasons, we can *cache* this data locally, and load it before applying mixup followed by training. The cached data will be available in `src/features_cache`.

## Original InstaHide

This also contains benchmark scripts that are based on an existing work: [**InstaHide: Instance-hiding Schemes for Private Distributed Learning**](https://arxiv.org/abs/2010.02772). The two modes of processing private datasets are described in the following subsections (i.e., InstaHide **Inside** and InstaHide **Cross**).

### InstaHide Inside Mode

The script [`instahide_inside.py`](./src/instahide_inside.py) is adapted from the original implementation by *Huang et. al.* ([source code](https://github.com/Hazelsuko07/InstaHide/blob/master/train_inside.py)). It can be executed with most of the original flags. Here is a list of commands:
```bash
python instahide_inside.py --mode instahide --klam 4 --data cifar10 --epoch 1 --model ResNeXt29_2x64d
```
```bash
python instahide_inside.py --mode mixup --klam 4 --data cifar100 --epoch 3 --model nasnet
```

### InstaHide Cross Mode

The script [`instahide_cross.py`](./src/instahide_cross.py) is adapted from the same implementation by *Huang et. al.*. It can be executed with most of the original flags. Here is a list of commands:
```bash
python instahide_cross.py --mode instahide --klam 8 --data mnist --pair --epochs 2
```
```bash
python instahide_cross.py --mode instahide --klam 8 --data mnist --pair --epochs 2 --model nasnet
```

*Note that there are several modifications made to the code for both `instahide_inside.py` and `instahide_cross.py` scripts as compared to their original form. However, these changes were made only from a flexibility standpoint, without changing the core functionality of the algorithms involved.*


### Security 

For the average distance experiment run:

```bash
python attack_images.py
```

For the mf selection experiment run:

```bash
python attack_images_wt.py
```

*The results of the security experiments may vary due to the inherent randomness involved in selecting noise values for the algorithm and in choosing images for plotting.*