# Codebase for Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments

To generate patient variables using adapted ADS-GAN model, run 
python3 -m main_adsgan.py.

To generate patient outcome of the synthetic data, run the analysis.py file, and turn on switch TRAIN_OUTCOME_ESTIMATOR

### Code explanation

(1) data_loader.py
- Load original data

(2) preprocess.py
- Preprocess original data

(3) adsgan.py
- Generate synthetic data using the original data

(4) compute_wd.py
- Compute Wasserstein distance between original data and synthetic data

(5) compute_identifiability.py
- Compare Identifiability of original data from synthetic data

(6) main_adsgan.py
- Run Adapted ADS-GAN, report the performances of ADS-GAN in terms of Wasserstein Distance, Identifiability, and Distribution.

(7) OutcomeEstimator.py
- An estimator that learns mapping from patient variables to outcomes, determines the causal effects during learning, and generate patient outcomes using learned mapping and causal effects.

(8) eval_models.py
- Evaluate causal models using synthetic data

(9) utils.py
- Helper functions

### Command inputs:

-   iterations: Number of experiments iterations
-   lamda: Hyper-parameter to control the identifiability and quality of the synthetic data
-   h_dim: Number of hidden state dimensions
-   z_dim: Number of random state dimensions
-   mb_size: Number of mini-batch samples

Note that hyper-parameters should be optimized for different datasets.

### Example command

```shell
$ python3 main_adsgan.py --iterations 10000 --lamda 0.1 --h_dim 30
--z_dim 10 --mb_size 128
```

## Motivation
Modal validation is important for any research. In data science, a large body of work focuses on testing and validating machine learning models. However, in causal inference, model validation is difficult because modern causal inference is based on the potential outcome framework, in which the ground truth of causal effects is unknown. Researchers in this field need data with ground truth that is as realistic as possible to evaluate and validate their models.

## Existing work
* Real datasets: IHDP, LIBIDD etc.
* Toy dataset with several variables and a simple data generation process.
* Work to generate EHRs.
* Work to generate patient outcomes (learn mapping from real data and use it to generate outcome).

## Our aim
We aim to create a synthetic dataset for a specific disease based on an insurance company's patient claim data (tabular data), with the following three goals: (1) make the dataset as realistic as possible; (2) generate treatment outcomes so that the causal effects are known; and (3) carry no information about any individual patients so that the data can be made available to the public (Data privacy policy of the insurance company).

## Our approach
* STEP 1: Patient variable generation using GAN.
* STEP 2: Fit a neural network to predict outcome. 
* STEP 3: Calculate outcomes for generated covariates using the trained network.
* STEP 4: Use the synthetic data to test some established causal inference algorithms (propensity score matching, doubly robust estimators etc.)