Repository for reproducing experiments from the paper "XGenBoost: Synthesizing Small and Large Tabular Datasets with XGBoost".



# Usage:
We suggest creating separate virtual/conda environment for each generator. This is the best way to avoid dependency conflicts. 

1. First create an environment:

```bash
conda create --prefix env python=3.10 -y
conda activate ./env
```

2. Now install the packages required for evaluation metrics, and your generative model of choice:

```bash
pip install --no-cache-dir -r requirements/evaluation/eval.txt
pip install --no-cache-dir -r requirements/generators/{GENERATOR}.txt
```


3. Now run the benchmark:

```bash
python3 synthesis.py --generator {GENERATOR_INSTANCE} --dataset {DATASET}
```

Results will be written to the results/ directory in csv format.

4. Finally, you can deactivate and remove your environment, before going back to step 1, to run the benchmark for another generator using a separate virtual/conda environment.

```bash
conda deactivate
rm -rf env
```


# Parameters

## Generators


You may choose one of the following generators {GENERATOR} for package installation:
- smote
- arf
- tabddpm
- tabsyn
- ctgan
- tvae
- unmaskingtrees
- forestdiffusion
- xgenboost


To run the benchmark you may need to run different instances of the same generator. You may choose an instance {GENERATOR_INSTANCE} from the [config files](/configs/model). Some examples are:
- xgenboost_diffusion_vddim
- xgenboost_diffusion_xddpm
- xgenboost_ar
- tabsyn
- tabddpm

Similarly, to run the ablations, you can change the configurations in the config files and run the benchmark regularly through synthesis.py. 



## Datasets

You may choose to run the results for all datasets of the Small Benchmark and Big Benchmark, respectively, by specifying {DATASET} as:
- small
- big

Otherwise, you may run the results for a single dataset by specifying {DATASET} as the name of a dataset found in the dataset's [config files](/configs/data/), for example:
- adult
- acsincome
- iris


### Load datasets
For the Small Benchmark, we provide a script to download all datasets from the web and build the configuration file:
```bash
python3 small_data_downloader.py
```

For the Big Benchmark, datasets can be downloaded manually using the links from the article. However, all datasets are also already provided in the repository.


# Formatting Results

Run the following script to print LaTeX tables as found in the article:

```bash
python3 format_results.py --filepath {RESULT_PATH} --table_type {TABLE_TYPE}
```

Here, {RESULT_PATH} is the path where the results are stored after running the benchmarks, typically results/small or results/big.

{TABLE_TYPE} is the type of table you wish to print:
- rank: the average rank scores over all datasets and metrics. Main results in the article.
- metric: per-metric average scores. Raw results in the article.