# CTTVAE: Generating Synthetic Tabular Data with Transformer-based VAEs

This is the official code for our paper "Latent Space Structuring for Conditional Tabular Data Generation on Imbalanced Datasets" ([paper]()).

## Folder Structure
- `src/`: Contains the main source code for the project.
- `conf/`: Configuration files, including `datagen.yaml`.

## Versions Used
- Python = 3.11.11
- Conda = 23.3.1
- Mamba = 1.4.2
- Cuda = 12.4

## Installation

You must have `conda` and `mamba` installed beforehand.

### 1. Create the environment
```bash
mamba env create -f environment.yml
```
Or use `conda` if `mamba` isn't available:
```bash
conda env create -f environment.yml
```

### 2. Activate the environment and install the package
```bash
conda activate syngen
```

### 3. Update the environment
To update the environment with the latest dependencies:
```bash
make update
```

### 4. Lock the environment
To generate a reproducible lockfile:
```bash
make lockfile
```

## Running Experiments

The default seed value is 42, modify the `seed` parameter in `datagen.yaml` for another value or leave it empty for no seed.

### 1. Using a local MLflow tracking server
Below is an opinionated, step‑by‑step recipe for running MLflow entirely on your workstation so every experiment is logged, searchable and reproducible.

#### 1. Enable MLflow tracking in datagen.yaml
To enable experiment tracking with MLflow, set the `mlflow` parameter to `True` in your `datagen.yaml` configuration file:
```yaml
mlflow: True
```
#### 2. Start a dedicated tracking server
Run this once in a separate terminal:
```bash
mlflow server \
  --backend-store-uri sqlite:///mlflow.db \
  --default-artifact-root ./artifacts \
  --host 0.0.0.0 --port 5000
```

### 2. Run one method with one dataset
To train and evaluate one model on one dataset:
1. Edit the `datagen.yaml` file located in the `src/syngen/conf` folder as instructed in that file.
2. Ensure the `run_training` and `run_eval` parameters are set to `True`.
3. The default hyperparameters are for the default dataset available 'vehicle_insurance_claim'. Commented values are for 'default_cc'.
4. Run the following command from the terminal at the same level as the `src` folder:
```bash
python -m src.syngen.main
```

### 3. Run multiple methods and datasets
To run benchmarks across multiple models and datasets with Hydra, use the following command:
```bash
python -m src.syngen.main --multirun datagen_method=cttvae,cttvae_tbs dataset=vehicle_insurance_claim,default_cc launcher=slurm_gpu
```
   - Replace `datagen_method`, `dataset`, and `launcher` with your desired values.
   - The `launcher` parameter is optional and can be configured for GPU execution (e.g., Slurm). Default value is `null`

### 4. Results
Results, models, and data are stored in directories specified in the `datagen.yaml` file. Below are the different paths to define:

- **Raw Data**: Define `paths.raw_data_dir` for the path where your raw data is stored.
- **Clean Data**: Define `paths.clean_data_dir` for the path where clean data (data without duplicates or missing values) is stored.
- **Processed Data**: Define `paths.processed_data_dir` for the path where train and test subsets are stored.
- **Synthetic Data**: Define `paths.synth_data_dir` for the path where synthetic data generated by the methods is stored.
- **Trained Models and Metrics**: Define `paths.datagen_methods_dir` for the path where trained models and training metrics (e.g., loss plots and values) are stored.
- **MLE Models**: Define `paths.MLE_models_dir` for the path where trained models for the Machine Learning Efficacy (MLE) score are stored.
- **Evaluation Results**: Define `paths.results_dir` for the path where evaluation results (e.g., metrics, logs) are stored.


## Dataset Licenses

This project includes the following publicly available datasets:

- [**Default of Credit Card Clients**](https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients): Available from the UCI Machine Learning Repository. Licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
- [**Vehicle Insurance Claim**](https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection): Available from the UCI Machine Learning Repository. Licensed under a CC0: Public Domain license.

These datasets are included solely for research and educational purposes. Please refer to the original sources for full licensing terms.



## Authors and Acknowledgment

Anonymous for now.

## License

This project is licensed under [LICENSE_NAME]. Please see the `LICENSE` file for more details.

## Project Status

Actively developed and tested. Paper submitted to AAAI 2026.

## Citing the Paper
If you use this code in your research, please cite our paper: