# ProgSyn Anonymous Code Release for Review
This repository is the anonymized code release for the ICLR 2024 submission entitled "Programmable Synthetic Data Generation".
Please, follow the instructions below to reproduce our results. Note that this code is intended to be only viewed by reviewers, and only for review purposes. Please, do not use or distribute.

## Setup
For setting up the environment, you will need conda to be installed on your system. If you do not have conda installed, please follow the instructions [here](https://docs.conda.io/en/latest/miniconda.html). Once conda is installed, you can use the following command to create the environment:
```
conda env create -f progsyn.yml
```
Before running any script or notebook in this repository, activate the above environment:
```
conda activate progsyn
```

## Reproducing our Results from the Paper
We list all commands for reproducing and viewing our results. Note that we include the pre-trained ProgSyn models, therefore in the constrained experiments only the fine-tuning step is executed (this saves considerable amounts of time). For the unconstrained experiments ProgSyn is trained from scratch. However, if you wish to train ProgSyn from scratch also for the constrained experiments, execute the following command from the top of this directory:
```
rm -rf experiment_data/programmable_synthesizer_experiments
```

### Non-Private Constraint Experiments on Adult
To reproduce all non-private constraint experiments on the Adult dataset, run:
```
python run_constraint_program_benchmark.py --dataset ADULT --n_samples 5 --n_resamples 5 --random_seed 42 --workload all_three_with_labels
python run_constraint_program_benchmark.py --dataset ADULT --n_samples 5 --n_resamples 5 --random_seed 42 --workload all_three_with_labels --baseline_mode
python run_constraint_program_rejection_sampling_baseline.py --dataset ADULT --n_samples 5 --n_resamples 5 --random_seed 42 --workload all_three_with_labels
```
The results can be viewed in the notebook: `constrained_benchmarks.ipynb`

### Private Constraint Experiments on Adult
To reproduce all non-private constraint experiments on the Adult dataset, run:
```
python run_constraint_program_benchmark.py --dataset ADULT --n_samples 5 --n_resamples 5 --random_seed 42 --workload all_three --epsilon 1.0
python run_constraint_program_benchmark.py --dataset ADULT --n_samples 5 --n_resamples 5 --random_seed 42 --workload all_three --epsilon 1.0 --baseline_mode
python run_constraint_program_rejection_sampling_baseline.py --dataset ADULT --n_samples 5 --n_resamples 5 --random_seed 42 --workload all_three --epsilon 1.0
```
The results can be viewed in the notebook: `constrained_benchmarks.ipynb`

### Stacking Constraints Adult
To reproduce the results on stacking logical constraints on Adult, run:
```
python run_adult_ablation.py --option 2 --n_samples 5 --n_resamples 5 --random_seed 42
python run_adult_ablation.py --option 2 --n_samples 5 --n_resamples 5 --random_seed 42 --baseline_mode
```
The results can be viewed in the notebook: `adult_ablation.ipynb`

To reproduce the results on stacking constraints of different types on Adult, run:
```
python run_adult_ablation.py --option 1 --n_samples 5 --n_resamples 5 --random_seed 42
python run_adult_ablation.py --option 1 --n_samples 5 --n_resamples 5 --random_seed 42 --baseline_mode
```
The results can be viewed in the notebook: `adult_ablation.ipynb`

### Health Heritage Experiments
To reproduce the results on the logical constraints on Health Heritage, run:
```
python run_constraint_program_benchmark.py --dataset HealthHeritage --n_samples 5 --n_resamples 5 --random_seed 42 --workload all_three_with_labels
python run_constraint_program_benchmark.py --dataset HealthHeritage --n_samples 5 --n_resamples 5 --random_seed 42 --workload all_three_with_labels --baseline_mode
python run_constraint_program_rejection_sampling_baseline.py --dataset HealthHeritage --n_samples 5 --n_resamples 5 --random_seed 42 --workload all_three_with_labels
```
The results can be viewed in the notebook: `constrained_benchmarks.ipynb`

To reproduce Figure 2 and the corresponding accuracy result, run the notebook:
```
health_condition_entropy.ipynb
```

### Unconstrained Data Generation (tables in Appendix D)
To reproduce our results for non-private unconstrained data generation on Adult, run:
```
python train_programmable_synthesizers_non_dp.py --dataset ADULT --n_samples 5 --workload all_three_with_labels --random_seed 42
python run_non_dp_benchmark_evals.py --dataset ADULT --model ProgSyn --n_samples 5 --n_resamples 5 --workload all_three_with_labels --random_seed 42
```
The results can be viewed in the notebook: `unconstrained_benchmarks.ipynb`

To reproduce our results for non-private unconstrained data generation on Health Heritage, run:
```
python train_programmable_synthesizers_non_dp.py --dataset HealthHeritage --n_samples 5 --workload all_three_with_labels --random_seed 42
python run_non_dp_benchmark_evals.py --dataset HealthHeritage --model ProgSyn --n_samples 5 --n_resamples 5 --workload all_three_with_labels --random_seed 42
```
The results can be viewed in the notebook: `unconstrained_benchmarks.ipynb`

To reproduce our results for private unconstrained data generation on Adult, run:
```
python train_programmable_synthesizers_dp.py --dataset ADULT --n_samples 5 --workload all_three --single_setup --epsilon 1.0 --random_seed 42
python run_dp_benchmark_evals.py --dataset ADULT --algorithm ProgSyn --workload all_three --n_samples 5 --n_resamples 5 --random_seed 42
```
The results can be viewed in the notebook: `unconstrained_benchmarks.ipynb`

To reproduce our results for private unconstrained data generation on Health Heritage, run:
```
python train_programmable_synthesizers_dp.py --dataset HealthHeritage --n_samples 5 --workload all_three --single_setup --epsilon 1.0 --random_seed 42
python run_dp_benchmark_evals.py --dataset HealthHeritage --algorithm ProgSyn --workload all_three --n_samples 5 --n_resamples 5 --random_seed 42
```
The results can be viewed in the notebook: `unconstrained_benchmarks.ipynb`

### Reproducing All Results at Once
To reproduce all results at once, run:
```
chmod +x reproduce_all_results.sh
./reproduce_all_results.sh
```
The results then can be viewed in the corresponding notebooks.

## Datasets Used
The repository contains the raw data for the two datasets used, UCI Adult [1], and Health Heritage Prize from Kaggle [2].

-----
[1] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.<br>
[2] Kaggle. Health heritage prize. https://www.kaggle.com/c/hhp, 2023. Accessed: May 17, 2023.