<h1 align="center">
  Hierarchical Multimodal Variational Autoencoder (HMVAE)
</h1>


We implemented various multimodal variational autoencoders. These models can generate data and are also useful for classification. They primarily differ in their latent organization where x are modalities and g and z latent variables:

<div align="center">
<img src="img/gen_overview.png" width="600" align="center" alt="Inference and generative models.">
</div>

<div align="center">
(a) <a href="https://arxiv.org/abs/1802.05335">MVAE</a>,
<a href="https://arxiv.org/abs/1911.03393">MMVAE</a>;
(b)
[<a href="https://openaccess.thecvf.com/content_ECCV_2018/papers/Xun_Huang_Multimodal_Unsupervised_Image-to-image_ECCV_2018_paper.pdf">1</a>,
<a href="https://arxiv.org/abs/1805.11264">2</a>,
<a href="https://arxiv.org/abs/2002.06661">3</a>,
<a href="https://arxiv.org/abs/2006.08242">4</a>,
<a href="https://mds.inf.ethz.ch/fileadmin/user_upload/gcpr_daunhawer_camera_ready.pdf">5</a>,
<a href="https://arxiv.org/abs/2012.13024">6</a>];
(c) proposed
</div>

<br>
We consider multiple modalities such as image and text describing birds:
<br>
<br>

<div align="center">
<img src="img/data.png" width="600" align="center", alt="Exemplary data and Venn diagram">
</div>

<br>
We argue that modality-specific variations can depend on shared structure. For example, consider the seabird poses (sitting and flying) above, which are only found in the images but depend on the bird species.

<br>
<br>

The proposed HMVAE operationalizes this idea.


## Requirements

1) Setup environment:

```setup
conda create --name hvae python=3.9
conda activate hvae
pip install -r requirements.txt
```

2) Prepare datasets

<details>
<summary>CUB dataset (click me)
</summary>

- Download data from the [official dataset website](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html). For example, using `pip install gdown` and `gdown https://drive.google.com/u/0/uc?id=1hbzc_P1FuxMkcabkgn9ZKinBwW683j45` on Ubuntu.
- Download sentence features from the [StackGAN Github repository](https://github.com/hanzhanggit/StackGAN). For example, using `pip install gdown` and `gdown https://drive.google.com/u/0/uc?id=0B3y_msrWZaXLT1BZdVdycDY5TEE` on Ubuntu.
- [Download ResNet features](http://datasets.d2.mpi-inf.mpg.de/xian/xlsa17.zip) extracted by [Xian et al.](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/zero-shot-learning/zero-shot-learning-the-good-the-bad-and-the-ugly).
</details>


<details><summary>Oxford Flower dataset (click me)</summary>

- Download the [images](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/102flowers.tgz) from the [official dataset website](https://www.robots.ox.ac.uk/~vgg/data/flowers/) and specify directory at [config_machine_specific.yaml](hyperparams/config_machine_specific.yaml).
- Run `python data/flowers/preprocess_images.py` to preprocess the images to 3x64x64 resolution.
- Run `python data/flowers/extract_features.py` to extract the ResNet features.
- Download the sentence features (see the above instructions for the CUB dataset).
</details>


3) Specify directories in [config_machine_specific.yaml](hyperparams/config_machine_specific.yaml).

4) Install LaTeX, which is required by Matplotlib:
```setup
sudo apt-get install texlive-latex-extra texlive-fonts-recommended dvipng cm-super
```

## Training
Run the command below to train a model.

```bash
python run.py --model multimodal_vae_moe --dataset flowers
```

<details><summary>Options (click me)</summary>

[--model]
- **'multimodal_vae_moe'**: VAE for two modalities using a mixture of experts posterior q(g|x_{1:M}) (MMVAE or HMVAE)
- **'multimodal_vae_poe'**: VAE for two modalities using a product of experts posterior q(g|x_{1:M}) (MVAE)

[--dataset]
- **'cub_ft'**: CUB dataset with image feature vectors R^2048 and caption feature vectors R^1024
- **'flowers_ft'**: Oxford Flower dataset with image feature vectors R^1024 and caption feature vectors R^1024
- **'flowers'**: Oxford Flower dataset with images R^{3x64x64} and caption feature vectors R^1024
- **'synthetic_data'**: synthetic dataset, where x_1 in R^2 and x2 are labels (not used in paper)

[--gpu]
</details>

<br/>

You can adjust the model hyperparameters in [hyperparams/](hyperparams/) (this package has a dedicated README). All models support any number of hierarchical levels, from 1 to N. A two-level multimodal hierarchical VAE trains in less than five hours on a single GPU on the datasets CUB (with image feature vectors) and Oxford Flower (both images and image feature vectors).

Note that we provide further implementations in dedicated packages:
- [Multimodal disentanglement VAEs](disentanglement_vae/)

## Evaluation
To evaluate a trained model, run the command below after having replaced the ID.
The code then generates experimental artifacts as specified in [config_general.yaml](/hyperparams/config_general.yaml).

```bash
python eval.py --id run_id  --split test
```


## Miscellaneous

We provide the code "as-is". We will give our best to answer and resolve reported issues. However, we cannot guarantee this.


<details>
<summary>Directory structure that is generated by code (click me)
</summary>

The directory structure is automatically generated:

```
├── expt
│   └── model
│       └── dataset
│           └── subdir                      # e.g., create directory for runs in paper
│               └── exp_name                # e.g., hierarchical_moe/
│                   └── trial               # e.g., without_regularization/
│                       └── run_path        # run_id/
│                           └── split       # e.g., train/ or val/
│                               └── artifact_dir  # e.g., epoch_750/
```
</details>

<details>
<summary>References (click me)
</summary>

The code is inspired by the following excellent repositories:
- [BIVA](https://github.com/vlievin/biva-pytorch)
- [Ladder VAE Reimplementation](https://github.com/addtt/ladder-vae-pytorch)
- [MMVAE](https://github.com/iffsid/mmvae)
- [VDVAE](https://github.com/openai/vdvae)
</details>




### Citation
We will include the BibTeX code here once the paper is published.
