# VEBench

We provide scripts for different stages of inference, including:

1. Video Editing model inference scripts
2. Scripts for processing human annotation results and calculating inter-annotator agreement
3. Scripts for calculating MLLM-generated scores and their correlation with human annotations

The overall directory structure is as follows:

```
├── annotations
│   ├── data_worker_1.csv
│   ├── data_worker_2.csv
│   ├── data_worker_3.csv
│   └── data_worker_4.csv
├── inference_MLLMs
│   ├── LLaVA-OneVision.py
│   └── VideoLLaMA-2.py
├── inference_Video_Edting_Models
│   ├── FateZero.py
│   ├── pix2video.py
│   ├── RAVE.py
│   ├── README.md
│   ├── template_bash.py
│   ├── Text2Video-Zero.py
│   ├── TokenFlow.py
│   ├── Tune-A-Video.py
│   ├── vid2vid-zero.py
│   └── VidToMe.py
├── inter-annotator-agreement.py
├── labeled_full.csv
├── MLLM_outputs
│   ├── Gemini-pro
│   │   ├── 000000.json
│   │   ├── 000001.json
│   │   ...
│   ├── LLaVA-OneVision-7B
│   │   ├── 000000.json
│   │   ├── 000001.json
        ...
│   └── VideoLLaMA2
│       ├── 000000.json
│       ├── 000001.json
        ...
├── mllm_result_correlation.py
└── README.md
```

## Video Editing Model Inference Scripts

Navigate to the `inference_Video_Edting_Models` directory and follow the instructions in `inference_Video_Edting_Models/README.md` to configure the environments for each video editing model. Execute the corresponding Python scripts to generate the sh script, and then run the sh script to quickly perform inference.

## Human Annotation Results and Inter-Annotator Agreement Calculation Script

The human annotation results for 1,280 inference samples are stored in the `annotations` directory, where `worker_{i}` corresponds to the ID of each annotator (with their personal information removed).

You can calculate the inter-annotator agreement for the four annotators based on our proposed criteria by running the `inter-annotator-agreement.py` script:

``` 
python inter-annotator-agreement.py

# output:
----------------------------------------
Metrics for Textual Faithfulness:
Averaged Kendall's τc: 0.6378 ± 0.0749
Averaged Spearman’s ρ: 0.7058 ± 0.0833
Krippendorff’s α: 0.6960
----------------------------------------
Metrics for Frame Consistency:
Averaged Kendall's τc: 0.6510 ± 0.0207
Averaged Spearman’s ρ: 0.7298 ± 0.0225
Krippendorff’s α: 0.6687
----------------------------------------
Metrics for Video Fidelity:
Averaged Kendall's τc: 0.6125 ± 0.0297
Averaged Spearman’s ρ: 0.6938 ± 0.0299
Krippendorff’s α: 0.6625
```

## MLLM Score Calculation and Correlation with Human Annotations

The inference scripts for using MLLMs to automatically follow the criteria guidelines for Video Editing tasks are located in the `inference_MLLMs` directory. We provide examples for two representative MLLMs: LLaVA-OneVision-7B and Video-LLaMA-2. Other MLLM inference scripts follow a similar structure.

Additionally, the `MLLM_outputs` directory contains the inference results for three MLLMs, with results stored in JSON format. Each MLLM has generated 1,280 scoring results for the Video Editing task.

You can calculate the correlation between MLLM-generated scores and human annotations, as well as the statistics for unmatched samples, by running the `mllm_result_correlation.py` script.

``` 
python mllm_result_correlation.py 

# output:
----------------------------------------
VideoLLaMA2:
Unmatched Textual Faithfulness Count: 11
Unmatched Frame Consistency Count: 209
Unmatched Video Fidelity Count: 339
0.42 & 0.45 & 0.36
0.22 & 0.21 & 0.16
0.11 & 0.11 & 0.09
----------------------------------------
Gemini-pro:
Unmatched Textual Faithfulness Count: 187
Unmatched Frame Consistency Count: 360
Unmatched Video Fidelity Count: 181
0.37 & 0.34 & 0.27
0.25 & 0.27 & 0.21
0.26 & 0.29 & 0.23
----------------------------------------
LLaVA-OneVision-7B:
Unmatched Textual Faithfulness Count: 0
Unmatched Frame Consistency Count: 0
Unmatched Video Fidelity Count: 0
0.49 & 0.48 & 0.39
0.17 & 0.18 & 0.14
0.07 & 0.07 & 0.06

```

