# Flex-M3: Learning Flexible Large Multimodal Models with Arbitrary Modality Combinations

# Setup

## Install Dependencies

```bash
conda create -n flex python=3.8
conda activate flex
pip install -e .
```


## Download Models

# Dataset Preparation & Feature Extraction
We preprocess our data following [CREMA](https://github.com/Yui010206/CREMA). Some preprocessed data are available at:

| Dataset | Multimodal Features |
| :----    |    :----  | 
| SQA3D | [Video Frames](https://drive.google.com/drive/folders/15b_IdwsbLU9iZPxR1Is3pObl-Le8v1ym), [Depth Map](https://drive.google.com/drive/folders/15b_IdwsbLU9iZPxR1Is3pObl-Le8v1ym), [Surface Normals](https://drive.google.com/drive/folders/15b_IdwsbLU9iZPxR1Is3pObl-Le8v1ym) |
| MUSIC-AVQA | [Video Frames](https://drive.google.com/drive/folders/15b_IdwsbLU9iZPxR1Is3pObl-Le8v1ym), [Optical Flow](https://drive.google.com/drive/folders/15b_IdwsbLU9iZPxR1Is3pObl-Le8v1ym) , [Depth Map](https://drive.google.com/drive/folders/15b_IdwsbLU9iZPxR1Is3pObl-Le8v1ym), [Surface Normals](https://drive.google.com/drive/folders/15b_IdwsbLU9iZPxR1Is3pObl-Le8v1ym) |
| NExT-QA | [Video Frames](https://drive.google.com/drive/folders/15b_IdwsbLU9iZPxR1Is3pObl-Le8v1ym), [Depth Map](https://drive.google.com/drive/folders/15b_IdwsbLU9iZPxR1Is3pObl-Le8v1ym), [Optical Flow](https://drive.google.com/drive/folders/15b_IdwsbLU9iZPxR1Is3pObl-Le8v1ym), [Surface Normals](https://drive.google.com/drive/folders/15b_IdwsbLU9iZPxR1Is3pObl-Le8v1ym)

# Training and Inference
We provide example training and inference scripts on NeXT-QA data as follows.

## 1) Training

```bash
sh run_scripts/finetune.sh
```

## 2) Inference

```bash
sh run_scripts/inference.sh
```
