[![PyPI version](https://badge.fury.io/py/disjoint-generation.svg)](https://badge.fury.io/py/disjoint-generation)
<!-- [![Doctests](https://github.com/notna07/disjoint-synthetic-data-generation/actions/workflows/doctests.yml/badge.svg)](https://github.com/notna07/disjoint-synthetic-data-generation/actions/workflows/doctests.yml) -->

# Disjoint Generative Models 

Disjoint Generative Models (DGMs) is a framework for generating synthetic data by distributing the generation of different attributes to different generative models. DGMs unlock mixed model generation, allowing the user to choose ``correct tool for the correct job'' and infers increased privacy by not having a single model that has access to all the data.

The library provides a simple API for generating synthetic data using a variety of generative models and joining strategies. The library has access to a variety of generative model backends namely [SynthCity](https://github.com/vanderschaarlab/synthcity), [DataSynthesizer](https://github.com/DataResponsibly/DataSynthesizer), [TabDiff](https://github.com/MinkaiXu/TabDiff), and [Synthpop](https://www.synthpop.org.uk/get-started.html), but additional backends can be added in the adapters module. Similarly several methods for joining are available for combining the generated data, and more can be added in the joining strategies module.

## Installation
 We recommend cloning and working with the repository directly due to challenging dependencies, but the library is also available on PyPI.
 To install the library, run the following command:

 ```bash
 pip install disjoint-generation
 ```

 One of the generative model backends "synthpop" requires a working R installation on the system. Access is handled through ```subprocess``` to run an ```Rscript``` command, so make sure that the Rscript command works in the terminal.

## Tutorial and Codebooks
 
Below is codebooks that can be used to replicate the results shown in the paper.
| Link | Description | Fig. refs. |
| --- | --- | --- |
| [Tutorial](00_tutorial.ipynb) | A simple tutorial on how to use the library | NA |
| [Codebook 1](01_same_model_partitions.ipynb) | Introductionary experiments, random joining, incresing number of partitions | Fig. 3, 10 |
| [Codebook 2](02_validated_joins.ipynb) | High-dimensional dataset example vith validation, correlated partitions study | Fig. 4, 5, 6, 13  |
| [Codebook 3](03_specified_splits.ipynb) | Mixed-model generation and combinatorics | Fig. 7, 8, Tab. 2, 3 |
| [Codebook 4](04_joining_validator.ipynb) | Study of the joining validator model, optimisation and calibration | Fig. 11, 12, 14 |

Additional examples for how to use the library can be seen in the documentation in the source code folder. 

## Requirements
The codebase requires Python ~3.10 (we use version 3.10.11) and the following packages:
- numpy ~= 1.26
- pandas ~= 2.2.3
- scipy ~= 1.12
- scikit-learn ~= 1.5
- synthcity >= 0.2.11
- DataSynthesizer ~= 0.1.13
- pyod >= 2.0

Because Synthcity sometimes causes compatibility issues we also provide an [environment.txt](tests/environment.txt) file with a `pip freeze` of a working installation, which can be downloaded and installed using `pip install -r environment.txt` and should work on Python ~3.10.11.

Additonally, the synthpop generative model is accessed through R (we used version 4.1.2), and requires the following R packages:
- synthpop ~= 1.8.0

## Citing
If you use our library in your work, you can reference us by citing our paper:
```
@misc{Lautrup2025,
      title={Disjoint Generative Models}, 
      author={Anton Danholt Lautrup and Muhammad Rajabinasab and Tobias Hyrup and Arthur Zimek and Peter Schneider-Kamp},
      year={2025},
      eprint={2507.19700},
      archivePrefix={arXiv},
}
```
