# Generating Multi-Table Time Series EHR from Latent Space with Minimal Preprocessing

This project develops and evaluates generative models for Electronic Health Records (EHRs) with a focus on multi-table time series data. This README provides a detailed guide on dataset structure, data preprocessing, model training, and evaluation pipeline.


## Setup Instructions

1. **Create and activate a conda environment**:
   ```bash
   conda create -n ehrsyn python=3.9
   conda activate ehrsyn
    ```
2. **Install required packages**:
   Install the necessary dependencies using pip:
   ```bash
   pip install sacred==0.8.5
   pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
   pip install tqdm
   pip install pandas
   pip install scikit-learn
   pip install transformers
   pip install wandb
   pip install einops
   pip install dython
    ```
    
## Data Preprocessing

To preprocess the EHR dataset, follow these steps:

1. **Set up the preprocessing pipeline**:
   - Refer to the `README.md` in the [integrated-ehr-pipeline repository](https://anonymous.4open.science/r/integrated-ehr-pipeline-D826) for initial setup instructions.
   
2. **Run the preprocessing script**:
   ```bash
   python ehrysn/datamodules/preprocess.py
   ```

## Dataset Structure

The dataset is organized into the following files:

- `${ehr}_hi_input.py`: Main input data file.
- `${ehr}_hi_type.py`: Type information for the input data.
- `${ehr}_hi_dpe.py`: Digit embedding information for the input data.
- `${ehr}_hi_num_time.py`: Numerical time data (e.g., `[140, 140, 720]`).
- `${ehr}_hi_time.py`: Time data arrays (e.g., `[[1,4], [1,4], [7,2]]`).
- `${ehr}_hi_input_reduced.py`: Vocabulary-reduced version of the input data (uses `${ehr}_word2id.pkl`).
- `${ehr}_split.csv`: File containing train/validation/test splits.
- `${ehr}_word2id.pkl`: Vocabulary mapping file for the reduced vocabulary version.

**Note**: Ensure the `${ehr}_word2id.pkl` file is correctly used when working with the reduced vocabulary input data.

## Training VQ-VAE

To train the Vector Quantized Variational Autoencoder (VQ-VAE) model, follow these steps:

1. Train the RQ-VAE for event compression:  
   ```bash
   bash run/train_VQVAE_indep.sh
   ```

2. Train the inter-event temporal modeling (AutoRegressive) model:  
   ```bash
   bash run/train_AR.sh
   ```