# VidHal: Benchmarking Hallucinations in Vision LLMs

VidHal is a benchmark designed to evaluate and analyze video-based hallucinations in Vision-Language Models (VLMs). It features a diverse set of videos covering five key temporal aspects: _Action, Attribute, Object, Event Order_, and _Direction_. To facilitate fine-grained evaluation of video hallucinations, we introduce a novel task of **caption ordering** alongside multiple-choice question answering. 

## Getting Started
### Dataset Download
The annotations and pre-defined randomized option orders for VidHal are located under the `vidhal` folder. The benchmark dataset videos will be released soon.

### Environment Setup
We provide the essential libraries and tools to run our evaluation code in `requirements.txt`. Install these dependencies along with those needed for your models using `pip`. 

## Model Evaluation 
We provide code for inference and evaluation on the VidHal benchmark, which can be adapted to suit your model's needs and requirements. Our evaluation pipeline consists of two steps: first, generating model predictions on the VidHal benchmark for a specified evaluation task, and second, comparing the predictions to the ground-truth answers.

### Inference 
The source code for generating model predictions on VidHal instances is located in `pipelines/inference`. The skeleton code, including the prompts used in our paper for all evaluation tasks, along with interfaces for running inference, is already implemented. To perform inference on VidHal with your model of choice using our pipeline, you may simply override the code in `pipelines/inference/base.py`:
```
class VidHalInferencePipeline:
...
    def format_prompt(
        self, 
        main_prompt, 
        options_prompt, 
        system_prompt=None, 
        *args, **kwargs):
        """
        NOTE: Implement this according to your model requirements

        Expected return type:
            prompts (tuple): Consisting of (main_prompt, system_prompt). If only one prompt is used, system prompt can be left optionally empty
        """
        raise NotImplementedError

    def generate_response(
        self, 
        model, 
        video, 
        main_prompt, system_prompt=None,
        generation_config={},
        *args, **kwargs):
        """
        NOTE: Implement this according to your model requirements

        Expected return type:
            response (str) : Response generated by the model.
        """
        raise NotImplementedError
...
```
which specifies the prompt format for your model and the response generation logic, respectively. Alternatively, you can create custom inference code by subclassing `VidHalInferencePipeline` and its task-specific derivatives and implement the above two functions in those files. If this implementation path is chosen, add your custom inference pipelines to `pipelines/inference/__init__.py` to allow them to be loaded by our driver scripts. Here's an example:
```
def get_inference_pipeline(name, task) -> VidHalInferencePipeline:
    return {
        ...
        "my_model": {
            "mcqa": MyMCQAInferencePipeline,
            "naive_ordering": MyNaiveOrderingInferencePipeline,
            "relative_ordering": MyRelativeOrderingInferencePipeline
        },
        ...
    }[name][task]

```
Our inference pipeline code automatically loads the specified models for their corresponding inference tasks. To add your model to the model repository, simply place your model files in the models directory and update the `load_model` function in the `models/__init__.py` file accordingly.

Finally, model responses can be generated by running `inference.py` with the required arguments. An example run is provided below: 
```
python inference.py \
    --model <my_model> \
    --task <task> \
    --annotations_path <annotations_path> \
    --videos_path <videos_path> \
    --save_path <save_path> 
```
where `<task>` specifies the evaluation task to be run and is selected from: `mcqa`, `naive_ordering`, or `relative_ordering`.

Command-line scripts for running `inference.py` with the desired arguments are also provided in the `scripts/inference` directory. `scripts/<task>/run_inference_random.sh` presents an example for generating random predictions, which can be referenced to create your own driver script. We additional provide several examples for running inference of selected VLLMs under `scripts/inference`.

### Evaluation
Once predictions for the selected evaluation task are generated, you can evaluate these responses by running evaluate.py. We demonstrated an example run below:
```
python evaluate.py \
    --task <task> \
    --annotations_path <annotations_path> \
    --predictions_path <path_to_my_model_predictions> # This should be the same as <save_path> in inference
```
Similarly to the inference stage, command-line scripts for running `evaluate.py` are provided in the `scripts/evaluation` directory.
