# Exploring High-Order Self-Similarity for Video Understanding
This is the official PyTorch implementation of the paper "Exploring High-Order Self-Similarity for Video Understanding".


## Installation
- PyTorch == 1.12.1
- TorchVision == 0.13.1
- cudatoolkit == 11.3.1
- fvcore == 0.1.6
- timm == 0.4.12
- RandAugment
- einops
- pprint
- dotmap
- yaml
- wandb


## Data Preparation
For efficient training, we pre-extract frames from videos to optimize data loading speed. For detailed instructions on video pre-processing, you can explore the [MMAction2](https://github.com/open-mmlab/mmaction2/tree/main/tools/data) repository. Each dataset uses annotation files structured as text files where every line contains information in the format: <filename> <#frames> <class index>. The fields are separated by whitespace as shown below:
```sh
zumba/yGdQwxP5koA_000083_000093 300 399
playing_paintball/DOL1_JLWeoo_000321_000331 300 240
```
You can find all annotation files for the datasets used in our experiments within the [lists](lists) directory, organized by dataset name.



## Train
- After Data Preparation, you need to set the data root path in the config file of the model you want to run. For example, to run MOSS-L on Something-Something V2, you should modify the [config](configs/sthv2/sthv2_train_clip_moss-l.yaml) by changing:
```
data:
    train_root: "<STHV2_ROOT>"
    val_root: "<STHV2_ROOT>"
```
to point to the folder where your extracted frames are located.
- Next, run the training script. For example, training MOSS-L on Something-Something V2 using the following command:
```sh
sh scripts/run_train_vision.sh 0 configs/sthv2/sthv2_train_clip_moss-l.yaml EXPR_NAME_HERE
```

## Test
- Run the following command to test the model.
```sh
sh scripts/run_test_vision.sh 0 configs/sthv2/sthv2_train_clip_moss-l.yaml exps/EXPR_NAME_HERE/model_best.pt --test_crops 3 --test_clips 2
```

## Logs

- We here provide training and testing logs for our MOSS-L models. Checkpoints will be publicly available after the paper is accepted.

- *Input = #input_frame x #spatial crops x #temporal clips*

| Model | Dataset | Input | Top-1 Acc.(%) | Logs |
|:------------:|:-------------------:|:------------------:|:-----------------:|:-----------------:|
| MOSS-L | Kinetics-400 | 16x3x4 | 87.7 | [log](logs/moss_l_16x224_k400.txt)
| MOSS-L | Something-Something V1 | 16x3x2 | 64.8 | [log](logs/moss_l_16x224_sthv1.txt) |
| MOSS-L | Something-Something V2 | 16x3x2 | 74.4 | [log](logs/moss_l_16x224_sthv2.txt) |
| MOSS-L | Diving-48 | 32x3x2 | 92.7 | [log](logs/moss_l_32x224_diving48.txt) |
| MOSS-L | FineGym (99 classes) | 32x3x2 | 94.7 | [log](logs/moss_l_32x224_finegym99.txt) |
| MOSS-L | FineGym (288 classes) | 32x3x2 | 71.1 | [log](logs/moss_l_32x224_finegym288.txt) |