## Multi-Modal Self-Supervision from Generalized Data Transformations 

This repo covers the implementation for GDT, which learns representations from multi-modal data in a self-supervised way. 

![Teaser Image](misc_files/GDT_splash.png)


## Highlights

**(1) Formulate and generalize any pretext tasks in a NCE objective.** 

Using this formulation, we test various pretext tasks previously unexplored and achieve SOTA downstream performance. 

**(2) Enforcing distinctiveness, rather than invariance, to time shift and time reversal leads to better representations**

We test learning distinctiveness and invaraince to different learning singals, such as time reversal and time shift, and found that distinctiveness is consistently better, contrary to results found in the image domain.  

## Installation

This repo was tested with Ubuntu 16.04.5 LTS, Python 3.7.5, PyTorch 1.3.1, Torchvision 0.4.1, and CUDA 10.0. 

### Step 1

- Clone this repo to your local machine

### Step 2

- Install required packages using `conda env create -f environment.yml`

### Step 3

- Activate conda environment using `conda activate gdt`, move all datasets to `./data/`, e.g. via `ln -s path/to/datax ./data/datax`.

### Step 4

- See below for how to pretrain GDT or benchmark pretrained models

## GDT pretraining
```
sbatch pretraining_scripts/cross_modal_sample_dist.sh
```
Please replace XXX in SLURM script before
- SBATCH directives
- SAV_FOLDER

## Pretrained Models

Pretrained weights can be found in [Dropbox](https://www.dropbox.com/sh/gga2ctjzyg12soc/AAAUiO45hZIyJQM2kh7CMRESa?dl=0).  
Unzip and place **model_weights** folder in main dir of repo.  

## Benchmarking

**Video Action Recognition Benchmarking**. 

To evaluate weights on video action recognition, run the following:
```
python3 benchmark_video_recognition.py --dataset {ucf101, hmdb51} --fold {1,2,3} --weights-path {WEIGHTS_PATH}
```

*HMDB-51*
|                 | 1 | 2 | 3 | 3-fold | 
| -------------   | - | - | -  |  - | 
| GDT (Kinetics)  | 60.81 | 60.53 | 58.89 | 60.0 | 
| GDT (VGGSound)  | 61.60 | 63.59 | 61.00 | 62.1 |
| GDT (Audioset)  | 67.77 | 64.92 | 64.99 | 65.9 | 
| GDT (IG65M)     | 73.94 | 72.76 | 71.62 | 72.8 |

*UCF-101*
|                 | 1 | 2 | 3 | 3-fold | 
| -------------   | - | - | -  |  - | 
| GDT (Kinetics)  | 87.86 | 89.30 | 90.61 | 89.3 | 
| GDT (VGGSound)  | 90.06 | 90.23 | 90.58 | 90.3 |
| GDT (Audioset)  | 92.24 | 93.18 | 91.95 | 92.5 | 
| GDT (IG65M)     | 94.67 | 94.89 | 95.92 | 95.2 |

**Video Retieval Bechmarking**.

To evaluate weights on video action retrieval, run the following:
```
python3 benchmark_video_retrieval.py --dataset {ucf101, hmdb51} --fold 1 --weights-path {WEIGHTS_PATH}
```

*HMDB-51*
|                 | 1 | 5 | 10 | 20 | 50 | 
| -------------   | - | - | -  |  - | -  | 
| GDT (Kinetics)  | 25.4 | 51.4 | 63.9 | 75.0 | 87.8 |
| GDT (VGGSound)  | 28.4 | 55.1 | 67.2 | 79.3 | 91.1 | 
| GDT (Audioset)  | 30.6 | 58.0 | 69.8 | 79.9 | 91.0 |
| GDT (IG65M)     | 36.1 | 61.1 | 70.8 | 79.7 | 92.1 |

*UCF-101*
|                 | 1 | 5 | 10 | 20 | 50 | 
| -------------   | - | - | -  |  - | -  | 
| GDT (Kinetics)  | 57.4 | 73.4 | 80.8 | 88.1 | 92.9 |
| GDT (VGGSound)  | 63.4 | 79.6 | 85.0 | 90.1 | 95.2 |
| GDT (Audioset)  | 65.9 | 82.6 | 88.2 | 92.2 | 96.6 |
| GDT (IG65M)     | 75.7 | 87.2 | 90.7 | 93.5 | 96.6 |


**Video Few Shot Bechmarking**. 

To evaluate weights on video few shot recognition, run the following:
```
python3 benchmark_few_shot.py --dataset {ucf101, hmdb51} --fold 1 --weights-path {WEIGHTS_PATH}
```

*HMDB-51*
|                 | 1 | 5 | 20 |
| -------------   | - | - | -  | 
| GDT (Kinetics)  | 13.4 | 15.6 | 20.8|

*UCF-101*
|                 | 1 | 5 | 20 |
| -------------   | - | - | -  | 
| GDT (Kinetics)  | 26.3 | 42.4 | 49.4


**Audio Bechmarking**. 

To evaluate weights on audio-feature linear-layer finetuning, run the following:
```
python3 linearprobe_audio.py --dataset {dcase2014, esc50} --fold {1,2,3,4,5} --weights-path {WEIGHTS_PATH}
```
*ESC-50/DCASE2014*
|                 | 1 | 2 | 3 | 4 | 5 | 5-fold | 
| -------------   | - | - | -  |  - | - | - |
| GDT (ESC50)  | 88.82 | 89.63 | 86.79 | 90.60 | 87.34 | 88.6 |
| GDT (DCASE2014)  | 98 | na | na  | na | na | 98 |


To evaluate weights on audio-feature full-finetuning, run the following:
```
python3 finetune_audio.py --weights-path {WEIGHTS_PATH}
```
*VGG-Sound*
|                 | mAP | AUC | d-prime |
| -------------   | - | - | -  | 
| GDT (VGGSound)  | 54.8 | 97.5 | 2.77
