# TabGenDDPM

Pytorch implementation of TabGenDDPM:

*Diffusion Models for Tabular Data Imputation and Synthetic Data Generation*

https://openreview.net/pdf?id=wiYV0KDAE6

## DATA SETUP

The current repository version includes ETL and TabGenDDPM best parameters for working with two datasets: 

* **Churn**: https://www.kaggle.com/datasets/shrutimechlearn/churn-modelling 
* **California Housing**: https://www.kaggle.com/datasets/camnugent/california-housing-prices?select=housing.csv

The data are organized as follows:

```
./data
│
└───CALIFORNIA_HOUSING
│   │   data.csv
│   
└───CHURN
    │   processed.csv
```


## EXECUTION SETUP

Config files used for Churn and California experiments:

```
./conf/tabular_diffusion
│   diffusion_california.json 
|   diffusion_churn.json

```

Before executing any experiments, the following fields has to be modified
* **current_model_name** (line 2): the name used to save the model checkpoints
* **checkpoint_path** (line 7): path where the checkpoints has to be saved.
* **device** (line 26): "cuda" or "cpu"

## TRAINING TabGenDDPM

```console
> python main.py -d <dataset_name> -td
```

```console
> python main.py -d churn -td
```

## ML UTILITY Test

```console
> python main.py -d <dataset_name> -sv
```

```console
> python main.py -d churn -sv
```

## TO-DO LIST (next version)
* How to work with custom dataset
* How to custom TabGenDDPM hyperparameters
* API to generate and save a fully simulated dataset using a trained TabGenDDPM

## Citation

```console
@inproceedings{
anonymous2023diffusion,
title={Diffusion Models for Tabular Data Imputation and Synthetic Data Generation},
author={Anonymous},
booktitle={Submitted to The Twelfth International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=wiYV0KDAE6},
note={under review}
}
```