# Contrastive Mixture of Posteriors for Counterfactual
# Inference, Data Integration and Fairness

This repository is the implementation of [Contrastive Mixture of Posteriors for CounterfactualInference, Data Integration and Fairness](https://openreview.net/forum?id=AjZCiKuWQ9n), containing pre-release code for review purposes only. Please note that the name for our method, referred to as "_CoMP_" in the paper, has changed over time and is referred to as "_CVaMP_", `ccvae` or `ContrastiveCVAE` in the source. The model definition can be found in `ccvae/pl/ccvae.py`.

## Requirements

### System requirements

The following environment was used for the experiments, based on the `nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04` docker image from [dockerhub](https://hub.docker.com/layers/nvidia/cuda/11.3.0-cudnn8-devel-ubuntu20.04/images/sha256-18beedacea7525d333a8c9e7419d42f26e6e84f8461fbf421da6c40243700d3a?context=explore):
* Ubuntu 20.04
* CUDA 11.0.3, cudnn 8
* Python 3.8

GPU and CPU Dockerfiles are included in the `docker` directory. To build the GPU image run:
```sh
docker build . -f docker/Dockerfile-gpu -t comp:latest
```

### Python environment
Running in a virtualenv or container is recommended. To install requirements:

```setup
pip install -U pip wheel
pip install -r requirements.txt
```


## Datasets

Datasets need to be downloaded from the following sources. The `<data-dir>` should be different for each dataset to keep their files separate.

### Tumour / Cell Line Dataset
There are four files to download and prepare, as follows:
1. **Metadata file**: The Celligner_info.csv file is taken from  [here](https://figshare.com/articles/dataset/Celligner_data/11965269). Save the file as `<data-dir>/Celligner_info.csv`.
2. **HGNC gene names**: The file hgnc\_complete\_set_7.24.2018.txt can be downloaded from [here](https://figshare.com/articles/dataset/Celligner_data/11965269). Save the file as `<data-dir>/hgnc_complete_set_7.24.2018.txt`.
3. **Tumour gene expression data**: The TPM expression values can be downloaded from [here](https://treehousegenomics.soe.ucsc.edu/public-data/previous-compendia.html#tumor_v10_polyA). Rename this file to `<data-dir>/TCGA_mat.tsv`.
4. **Cell Line gene expression data**: The data is downloaded from DepMap Public 19Q4 file: CCLE\_expression_full.csv [here](https://figshare.com/articles/dataset/DepMap_19Q4_Public/11384241). Rename this file to `<data-dir>/CCLE.csv`.

### Single cell PBMCs

Data was downloaded from the [theislab/trVAE_reproducibility](https://github.com/theislab/trVAE_reproducibility) repository. Download `kang_count.h5ad` from the Google Drive link in the _Getting Started_ section of the README to `<data-dir>/kang_count.h5ad`.

### UCI Adult Income

Data was downloaded from the UCI Machine Learning [Repository](https://archive.ics.uci.edu/ml/datasets/adult). Download all `adult.{data,names,test}` files from the [data directory](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/) to a new `<data-dir>` and run the following pre-processing script on the directory.

```sh
python3 ./process_uci_income.py --data-dir <data-dir>
```

## Training and evaluation

Evaluation metrics are computed at the end of the training loop for the best 3 checkpoints (by validation loss), as well as the last, and saved in `<output-dir>/metrics.csv`.
In all the comands below the `<data-dir>` below should be the path the directory containing the files listed in the previous section. `<output-dir>` should be the desired directory in which to save the outputs. It will be created if it does not exist.

To run the code using only the CPU, omit the `--use-cuda` flag in the arguments to the run script.

https://archive.ics.uci.edu/ml/machine-learning-databases/adult/

### Tumor / cell-line

#### CoMP
```sh
python3 run.py --data-dir <data-dir> --output-dir <output-dir> --dataset celligner --model contrastive_cvae --likelihood gaussian --hidden-dim 512 --latent-dim 16 --num-layers 3 --use-batchnorm 1 --batch-size 5500 --num-epochs 4000 --learning-rate 0.0001 --penalty-scale 0.5 --kl-beta 1e-07 --learn-sigma fix_all --seed 80244971 --batching-mode uniform --top-var-number 8000 --use-cuda
```

### Single cell PBMCs

#### CoMP
```sh
python3 run.py --data-dir <data-dir> --output-dir <output-dir> --dataset kang-trvae --model contrastive_cvae --likelihood gaussian --hidden-dim 512 --latent-dim 40 --num-layers 3 --use-batchnorm 1 --batch-size 512 --num-epochs 10000 --learning-rate 1e-06 --penalty-scale 1.0 --kl-beta 1e-07 --learn-sigma fix_all --bandwidth 0.1 --seed 196117 --batching-mode uniform --use-cuda
```

### UCI Adult Income

#### CoMP

```sh
python3 run.py --data-dir <data-dir> --output-dir <output-dir> --dataset uci-income --forward-use-groups 0 --model contrastive_cvae --likelihood gaussian --hidden-dim 64 --latent-dim 16 --num-layers 2 --use-batchnorm 1 --batch-size 4096 --num-epochs 10000 --learning-rate 0.0001 --kl-beta 1.0 --penalty-scale 0.5 --learn-sigma fix_all --seed 116983357 --use-cuda
```

## Results

Our model achieves the following performance on :

### Tumour / Cell Line Dataset

| Model | silhouette | kbet | mean-silhouette |
|-------|------------|------|-----------------|
| VAE | 0.658|  0.974| 0.803| 0.581|
| CVAE | 0.554| 0.931| 0.684| 0.571|
| VFAE | 0.168| 0.258| 0.198| 0.188|
| trVAE  | 0.096|  0.163| 0.138| 0.123|
| Celligner | 0.082| 0.525| 0.568| 0.226|
| CoMP (ours) | 0.023| 0.160| 0.094| 0.101|

### UCI Adult Income

| Model | Gender Acc. | Income Acc. | silhouette | kbet |
|-------|-------------|-------------|------------|------|
| Original data | 0.796 | 0.849 | 0.067 | 0.786|
| VAE | 0.764 | 0.812 | 0.054 | 0.748 |
| CVAE | 0.778 | 0.819 | 0.054 | 0.724 |
| VFAE | 0.789 | 0.805 | 0.046 | 0.571 |
| trVAE | 0.698 | 0.808 | 0.066 | 0.731 |
| CoMP (ours) | 0.679 | 0.805 | 0.011 | 0.451 |

