# ExpoTab

## Installation

UV Installation:
```
pip install uv
uv venv
source .venv/bin/activate
uv sync
```

And add more packages with (if needed)
```
uv add numpy
```
## Datasets

Download baseline datasets and process with the following commands:
```
python download_dataset.py
python process_dataset.py
```
Classification:
- **Adult**: https://archive.ics.uci.edu/dataset/2/adult
- **Default** of Credit Card Clients: https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients
- **Stroke** Prediction: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
- **Shoppers** Purchasing Intention: https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset

Multiclass-Classification:
- **Diabetes** Readmissions: https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008

Regression:
- **Beijing** PM2.5: https://archive.ics.uci.edu/dataset/381/beijing+pm2+5+data
- **News** Popularity: https://archive.ics.uci.edu/dataset/332/online+news+popularity

As Kaggle requires user log-on, the **Stroke** dataset should be downloaded manually.

To process Kaggle datasets:
```
cd data
mkdir [dataname]
# Move the downloaded Kaggle dataset .csv file into the newly-created folder
cd ..
python process_dataset.py [dataname]
```

## Training, Sampling and Evaluation


Available methods:
- `expotab` - the one-step generation methods
- `tab_geoflow` - the standard flow matching
- `tab_geodiff` - the DDPM-based generation

All methods use the Expotab encoder and decoder.


### Parameters:
--k (default=1, type=float) 
--alpha (default=0.5, type=float)
--p (default=0.5, type=float)
--time_sampler (default='uniform', type=str)
--sigma (default=0, type=float)


### Training:
```bash
python main.py --dataname [NAME_OF_DATASET] --method [NAME_OF_BASELINE_METHODS] --mode train

python main.py --dataname adult --method expotab --mode train
```

### Sampling:
```bash
python main.py --dataname [NAME_OF_DATASET] --method [NAME_OF_BASELINE_METHODS] --mode sample

python main.py --dataname adult --method expotab --mode sample
```

Evaluation:

We include Machine Learning Efficiency (AUC, RMSE, F1 Scores), Density Estimation (Pair-Wise Column Correlation/Column Density Estimation), Quality ($\alpha$-precision, $\beta$-recall), Detection (Classifier Two Sample Test), DCR (Distance to Closest Record) and Privacy Preservation (Membership Inference Attacks) benchmarks to evaluate the models.

For ```eval_privacy.py```, an extra step is required to setup the package. Within the ```syntheval``` pip installation in the original ```tabdpo``` environment (i.e. ```python3.10/site-packages/syntheval/presets```), create a new ```mia.json``` file and write the following:

```
{
    "mia"  : {"num_eval_iter": 5}
}
```

Then evaluate as follows:

```
python eval/eval_mle.py --dataname [NAME_OF_DATASET] --model [NAME_OF_BASELINE_METHODS] --path [PATH_TO_SYNTHETIC_DATA]
python eval/eval_density.py --dataname [NAME_OF_DATASET] --model [METHOD_NAME] --path [PATH_TO_SYNTHETIC_DATA]
python eval/eval_quality.py --dataname [NAME_OF_DATASET] --model [METHOD_NAME] --path [PATH_TO_SYNTHETIC_DATA]
python eval/eval_detection.py --dataname [NAME_OF_DATASET] --model [METHOD_NAME] --path [PATH_TO_SYNTHETIC_DATA]
python eval/eval_privacy.py --dataname [NAME_OF_DATASET] --model [METHOD_NAME] --path [PATH_TO_SYNTHETIC_DATA]
python eval/eval_dcr.py --dataname [NAME_OF_DATASET] --model [METHOD_NAME] --path [PATH_TO_SYNTHETIC_DATA]
```
```
python eval/eval_mle.py --dataname adult --model expotab --path synthetic/adult/expotab.csv
```
