# CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs

This repository contains the code for our paper submission "CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs" (anonymous submission).



## Data Preparation

Before running the pipeline, please prepare the following datasets:

1. **LAION-400M**: Used for pre-training the generator
   - Download using [webdataset](https://github.com/webdataset/webdataset)
   - Example: `pip install webdataset`
   - The dataset will be accessed in TAR format

2. **LLaVA-Instruct-150K**: Used for vision-language instruction tuning
   - Automatically downloaded via HuggingFace

3. **Video-MME Dataset**: Used for video finetuning
   - Automatically downloaded via HuggingFace

4. **MMBench-Video**: Used for evaluation
   - Automatically downloaded via HuggingFace

5. **MME Dataset**: Used for evaluation
   - Automatically downloaded via HuggingFace

## Training Pipeline

Our method follows a three-stage training process:

### 1. Pre-training on LAION-400M

The first stage involves pre-training our generator network on the LAION-400M dataset:

```bash
torchrun --nproc_per_node=8 pretrain.py --image_input_dir /path/to/laion400m/tars --save_path ./weights/laion/
```

### 2. Visual Finetuning

The second stage involves finetuning the generator on visual instruction data:

```bash
torchrun --nproc_per_node=8 visual_finetuning.py --generator_checkpoint ./weights/laion/checkpoint_latest.pth --save_path ./weights/visual/
```

### 3. Video Finetuning

The final training stage involves finetuning the generator on video data:

```bash
torchrun --nproc_per_node=8 video_finetuning.py --generator_checkpoint ./weights/visual/checkpoint_latest.pth --save_path ./weights/video/
```

## Generating Adversarial Videos

After completing the training stages, you can generate adversarial videos:

```bash
python generate.py --generator_checkpoint ./weights/video/checkpoint_latest.pth --video_input_dir /path/to/mmbench-video/video --video_output_dir /path/to/mmbench-video-adv/video
```

## Evaluation

We use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for evaluation. To evaluate our adversarial videos:

1. Clone the VLMEvalKit repository:
   ```bash
   git clone https://github.com/open-compass/VLMEvalKit.git
   cd VLMEvalKit
   ```

2. Replace the clean MMBench-Video dataset with our adversarial videos in the appropriate directory

3. Configure your API key in the evaluation config file

4. Run the evaluation script:
   ```bash
   python eval.py --model YOUR_MODEL_NAME --task mmbench_video
   ```

## Requirements

- PyTorch >= 2.0
- CUDA 11.8 or higher
- webdataset
- decord (for video processing)
- transformers
- Other dependencies as listed in requirements.txt
