# Supplementary Material: How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

<p align="center">
    <img src="https://i.imgur.com/waxVImv.png" alt="Image">
</p>

> **How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs** <br>


[![Anonoymous Dataset](https://img.shields.io/badge/Dataset-Access-<COLOR>)](https://drive.google.com/drive/folders/1t2-DnLhJpchzKgW-2jWmVIu75wzp9L5W?usp=sharing)


## Getting started with CVRR-ES

### Downloading and Setting Up CVRR-ES Dataset

Set up the CVRR-ES dataset by following the below steps. 
1) CVRR-ES dataset can be downloaded [using this link (zipped)](https://drive.google.com/drive/folders/1t2-DnLhJpchzKgW-2jWmVIu75wzp9L5W?usp=sharing). CVRR-ES benchmark consists of 2400 open-ended question-answer (QA) pairs spanning over 214 unique videos and covers 11 diverse evaluation dimensions.
After unzipping, the CVRR-ES dataset structure looks like the following:

```
CVRR-ES/
|–– interpretation_of_visual_context/
|   |–– annotations_interpretation_of_visual_context.json
|   |–– captions_interpretation_of_visual_context.json
|   |–– 163.mp4
|   |–– ... # remaining videos
|–– partial_actions/
|   |–– annotations_partial_actions.json
|   |–– captions_partial_actions.json
|   |–– 121.mp4
|   |–– ... # remaining videos
|–– unusual_and_physically_anomalous_activities/
|   |–– annotations_interpretation_of_visual_counusual_and_physically_anomalous_activities.json
|   |–– captions_unusual_and_physically_anomalous_activities.json
|   |–– 101.mp4
|   |–– ... # remaining videos
... # remaining video-evaluation dimension folders
```

Here, each folder corresponds to a single video evaluation dimension and contains annotations (QA pairs and captions) alongside videos. 

Now note that videos utilized from [Something-Something V2 Dataset](https://developer.qualcomm.com/software/ai-datasets/something-something) (SSv2) have been not included in the zipped folder due to copyright policies. In order to complete the dataset, first:

2) Download SSv2 dataset from [official website](https://developer.qualcomm.com/software/ai-datasets/something-something) (it is publicly available). You will be prompted to register yourself by creating an account. 

3) Identify the videos for CVRR-ES dataset by retrieving the videos with ids given in [this text file](assets/ssv2_videos.csv).
4) Rename the videos following the mapping in the text file and add them to their respective evaluation dimension folder in the unzipped CVRR-ES folder. 
### Evaluating Video-LMMs on CVRR-Evaluation Suite
To evaluate Video-LMMs on the CVRR-ES benchmark, please follow the following steps:

#### 0) Installation
Follow the instructions in [INSTALL.md](assets/INSTALL.md) to install packages and model weights required to run the sample Video-LMM codes for evaluation. 

#### 1) Generating Predictions for CVRR-ES dataset from Video-LMMs

For each QA pair, we generate answers from Video-LMMs in an autoregressive manner. Predictions are generated using either standard prompting (i.e., question only) or using our Dual-Step Contextual Prompting technique (DSCP). Follow [PREDICTIONS.md](assets/PREDICTIONS.md) for sample codes for generating answers using TimeChat, Video-LLaVA, GPT4-Vision and Gemini-Vision-Pro. 

#### 2) Comparing the Predicted Answers with Ground-Truth Answers using LLM-Assisted evaluation
Once the answer predictions are generated from step 1, we utilize LLM as a Judge to measure/quantify the correctness of Video-LMMs prediction for each question in the CVRR-Evaluation Suite. Please follow the instructions in [LLM_SCORING.md](assets/LLM_SCORING.md) for using LMM-Assisted evaluation. 
<hr />

## Additional Assets for LLM based QA generation process:

#### Generating LLM based question-answer pairs from video-caption pairs for CVRR-ES

The first version of the CVRR-ES dataset is already finalized. However, for additional reference, we are providing code snippets alongside LLM prompts that we used to generate the initial set of QA pairs.

Please refer to [QA_GENERATION.md](assets/QA_GENERATION.md) for instructions and sample code on generating question-answer pairs for CVRR-ES videos using LLM.

<hr />

## License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The videos in CVRR-ES dataset are collected from public academic benchmarks (refer to [main paper](https://arxiv.org/pdf/2405.03690) for more details) and from YouTube and are for academic research use only. 
By using CVRR-ES, you agree not to use the dataset for any harm or unfair discrminiation. Please note that the data in this dataset may be subject to other agreements. Video copyrights belong to the original dataset providers, video creators, or platforms.
