# Select the Key, Then Generate the Rest: Improving Multi-Modal Learning with Limited Data Budget

## Table of Contents

- [Setup Environment](#installation)
- [Usage](#usage)

## Setup Environment
Set up the environment by running the following commands:
```bash
conda env create -f environment.yml
```

## Usage

1. Donload the dataset. We use the aligned CMU-MOSI and CMU-MOSEI [data](https://drive.google.com/drive/folders/1BBadVSptOe4h8TWchkhWZRLJw8YG_aEi). The language modality data was extracted via pre-trained BERT model and obtain a 768-dimensional hidden state as the word features. For visual modality, each video frame was encoded via Facet to represent the presence of the total 35 facial action units. The acoustic modality was processed by COVAREP to obtain the 74-dimensional features.
   
2. Run training and evaluation via:
```
bash run.sh
```
