# DENOISER: Rethinking Robustness for Open-Vocabulary Action Recognition

## Prerequisites
This code requires the same libraries as ActionCLIP, with extra:
```
editdistance
nlpaug
pyspellchecker
neuspell
```
## Dataset preparation
We prepare our dataset the same way as ActionCLIP does.

In general:

1. Extract frames from videos into separate folders, one for each video
2. Write `path_to_folder`, `#num_frames` and `#classid`, separated by space into a file:
```
path_to_folder #num_frames #classid
path_to_folder #num_frames #classid
...
```
3. Update path to this file in config:
```
data:
└── val_list: # path_to_this_file
```

## Pretrained Models
We use ActionCLIP [K400-pretrained models](https://github.com/sallymmx/ActionCLIP?tab=readme-ov-file#kinetics-400) for zero-shot inference on K700, HMDB51 and UCF101.

Download them and change path in config. For example, change in `./configs/ucf101/ucf_zero_shot.yaml`:
```
pretrain: # path to pretained model
```
## Getting started
### Cache visual features
```
python cache_features.py
```
Visual features and their class id will be cached in:
```
features
├── class_id_list
└── image_features_list
```
### Test robusteness of ActionCLIP
```
bash test_vanilla.sh
```
### Test DENOISER
```
bash test_DENOISER.sh
``` 


## Acknowledgments
Our code is based on [AcitonCLIP](https://github.com/sallymmx/ActionCLIP).