# v2 -- InstaHide Feature-Space Mixup Experiments

## Prerequisites

Some datasets are too large to bundle and must be available locally. Set these environment variables before running any script that uses them:

```bash
export CIFAR_5M_FULL_PATH="/path/to/cifar5m"   # directory containing part0.npz ... part5.npz
export IMAGENET_FULL_PATH="/path/to/imagenet"   # directory with parquet shards
```

Smaller datasets (`mnist`, `cifar10`, `cifar100`, `tiny-imagenet`) are downloaded automatically to `./data`.

## Scripts

### utils.py

Shared utilities used by all other scripts: device selection, deterministic seeding, dataset/transform loading, normalization helpers, noise generation, classifier, feature-space mixup, and visualization.

### augment.py

Single-machine feature-space mixup training with a frozen ResNet backbone and a dense classifier. Supports resuming from a checkpoint via `--checkpoint`.

```bash
python augment.py --model resnet18 --dataset cifar10 --epochs 200 --bench
python augment.py --model resnet50 --dataset tiny-imagenet --epochs 100 --radius 631.28 --lr 0.001

# Resume from a saved checkpoint
python augment.py --model resnet18 --dataset cifar10 --epochs 200 --bench \
    --checkpoint checkpoints/best_instahide_classifier_resnet18_cifar10_200epochs.pth
```

### efficient_augment.py

Same workflow as `augment.py` but using EfficientNet (B0 or V2-S) as the backbone. Also supports `--checkpoint` for resume.

```bash
python efficient_augment.py --dataset cifar10 --epochs 10 --v2
python efficient_augment.py --dataset cifar100 --epochs 20 --lr 0.01 --quick

# Resume from a saved checkpoint
python efficient_augment.py --dataset cifar5m --epochs 200 --v2 \
    --checkpoint checkpoints/best_instahide_classifier_effnet_v2_cifar5m_200epochs.pth
```

### federated_augment.py

Federated learning experiment: compares a single-party baseline (no mixup) vs mixup-union across all parties (equal splits).

```bash
python federated_augment.py --epochs 5 --bench --radius 1.0
```

### collaborative_training.py

FedProx vs mixup-union with Dirichlet non-iid splits and per-party radius from tau. Supports size-skew experiments (one data-poor party via `--rho`) and minimum party size guarantees.

```bash
python collaborative_training.py --num-parties 10 --tau 1e-6 --epochs 5 --bench
python collaborative_training.py --num-parties 20 --tau 1e-3 --dirichlet-alpha 0.1 --mu 0.01

# Size-skew: party 0 gets rho=0.1 of the data (10x smaller than others)
python collaborative_training.py --num-parties 5 --tau 1e-6 --epochs 10 --bench \
    --rho 0.1 --poor-party 0 --min-party-size 50
```

### radius_approx.py

Estimates feature-space displacement (radius) induced by dataset-calibrated noise levels across datasets and tau values.

```bash
python radius_approx.py --backbone resnet18 --datasets cifar10 cifar100
python radius_approx.py --backbone resnet50 --datasets tiny-imagenet --taus 1e-2 1e-4 1e-6
```

### linear_attack.py

Linear reconstruction attack on mixup (known mixing graph, TV+L2 priors). Reports SNR, SSIM, LPIPS.

```bash
python linear_attack.py --datasets cifar10 --taus 1e-1 1e-3 1e-6 --sub_size 256
python linear_attack.py --datasets mnist cifar10 cifar100 --attack_steps 300 --seed 42
```

### non_linear_attack.py

Non-linear (U-Net) reconstruction attack. Trains on a public dataset, evaluates zero-shot on CIFAR-10.

```bash
python non_linear_attack.py --taus "1e-1, 1e-3, 1e-6" --epochs 30 --seed 1137
python non_linear_attack.py --curated --taus "1e-2, 1e-4" --epochs 50
```

### eval_metrics.py

Evaluate saved U-Net checkpoints (SNR, SSIM, LPIPS) without retraining.

```bash
python eval_metrics.py --checkpoint unet_attack_figs/seed_1137/unet_tau0.01_seed1137.pt --seed 1137 --alpha 0.7 --tau 0.01
python eval_metrics.py --checkpoint-dir unet_attack_figs/seed_1137 --seed 1137 --alpha 0.7
```

### outliers.py

LPIPS-based outlier detection: measure how many mixup images are perceptually too similar to originals.

```bash
# Analysis mode: sweep tau at fixed LPIPS threshold
python outliers.py --data cifar10 --subset_size 512 --analysis
python outliers.py --data cifar100 --subset_size 1024 --analysis --analysis_lpips_th 0.5

# Single-run mode: fixed tau and LPIPS threshold
python outliers.py --data cifar10 --tau 1e-6 --lpips_th 0.6 --subset_size 512

# With inspect PNGs (saves sample pairs per threshold)
python outliers.py --data cifar10 --tau 1e-6 --lpips_th 0.6 --subset_size 512 --inspect
```

### generate_curated_tiny_imagenet.py

Creates a curated Tiny-ImageNet subset matching CIFAR-10 semantic classes. Edit paths in the script before running.

```bash
python generate_curated_tiny_imagenet.py
```

### imagenet_dataset.py

Parquet-backed ImageNet-1K dataset loaders (streaming and preloaded). Used internally by other scripts; not run directly.
