# MovieChat SpookyBench Evaluator

A command-line utility for evaluating MovieChat on the SpookyBench temporal pattern recognition dataset.

## Overview

This script allows you to evaluate the MovieChat model on temporal pattern recognition tasks using the SpookyBench dataset. The MovieChat model is a state-of-the-art video understanding model that can process videos and answer questions about their content.

## Prerequisites

- Python 3.10+
- PyTorch
- MovieChat dependencies (installed as per MovieChat repository instructions)
- OpenCV for video processing
- Pandas for dataset handling

## Usage

The script supports evaluating videos from the SpookyBench dataset with different configuration options:

```bash
python run_moviechat.py \
  --dataset /path/to/spookybench/directory \
  --csv /path/to/metadata.csv \
  --categories words \
  --use_cot \
  --sample_size 10 \
  --output ./results
```

## Full Command-Line Arguments

```
usage: run_moviechat.py [-h] --dataset DATASET --csv CSV
                      [--categories {words,images,videos} [{words,images,videos} ...]]
                      [--use_cot] [--sample_size SAMPLE_SIZE]
                      [--device DEVICE] [--n_frames N_FRAMES]
                      [--image_size IMAGE_SIZE] [--prompt PROMPT]
                      [--temp_dir TEMP_DIR] [--output OUTPUT]

Run MovieChat on SpookyBench videos

optional arguments:
  -h, --help            show this help message and exit
  --dataset DATASET     Path to SpookyBench dataset directory or a single video file
  --csv CSV             Path to SpookyBench metadata CSV
  --categories {words,images,videos} [{words,images,videos} ...]
                        Categories to process (if not specified, all categories are used)
  --use_cot             Use chain-of-thought prompting instead of direct prompting
  --sample_size SAMPLE_SIZE
                        Number of videos to sample per category (if not specified, all videos are used)
  --device DEVICE       Device to use for inference (default: cuda:0)
  --n_frames N_FRAMES   Number of frames to process (default: 8)
  --image_size IMAGE_SIZE
                        Size of input frames (default: 224)
  --prompt PROMPT       Custom prompt to use for all videos (overrides category-specific prompts)
  --temp_dir TEMP_DIR   Directory to store temporary fragments (default: ./temp_fragments)
  --output OUTPUT       Output directory for results (default: ./results)
```

## Examples

### Process Videos from a Specific Category

```bash
CUDA_VISIBLE_DEVICES=0 python run_moviechat.py \
  --dataset /path/to/spookybench \
  --csv /path/to/metadata.csv \
  --categories words \
  --use_cot \
  --sample_size 5 \
  --output ./results
```

### Process Videos with Custom Parameters

```bash
CUDA_VISIBLE_DEVICES=0 python run_moviechat.py \
  --dataset /path/to/spookybench \
  --csv /path/to/metadata.csv \
  --categories words \
  --n_frames 12 \
  --image_size 256 \
  --output ./custom_results
```

### Use a Custom Prompt for All Videos

```bash
CUDA_VISIBLE_DEVICES=0 python run_moviechat.py \
  --dataset /path/to/spookybench \
  --csv /path/to/metadata.csv \
  --prompt "What is the hidden message in this temporal pattern? Answer with just the message." \
  --output ./custom_prompt_results
```

## SpookyBench Categories

The SpookyBench dataset contains videos across several categories, each with specific temporal patterns:

1. **Words**: Text encoded through temporal patterns
2. **Images**: Common objects encoded through temporal patterns
3. **Videos**: Movement encoded through temporal patterns

The script uses appropriate prompts for each category when processing the dataset, which can be enhanced with chain-of-thought (--use_cot) for better performance.

## Memory Management

MovieChat requires substantial GPU memory. The script includes memory management features:
- Automatic cleanup of temporary files
- Garbage collection between video processing
- Delay between videos to allow for memory recovery

## Analyzing Results

After running the model on the SpookyBench dataset, you can analyze the results using the provided `analyze_results.py` script. This script compares model predictions against ground truth data and generates accuracy metrics and visualizations.

### Usage

```bash
python analyze_results.py \
  --results /path/to/results.json \
  --csv /path/to/metadata.csv \
  --output ./analysis_output
```

### Command-Line Arguments

```
usage: analyze_results.py [-h] --results RESULTS --csv CSV [--output OUTPUT]

Analyze MovieChat SpookyBench results

optional arguments:
  -h, --help       show this help message and exit
  --results RESULTS  Path to JSON results file generated by run_moviechat.py
  --csv CSV        Path to SpookyBench metadata CSV
  --output OUTPUT  Directory to save analysis results (default: ./analysis)
```

### Example Workflow

1. First, run the model on the SpookyBench dataset:

```bash
CUDA_VISIBLE_DEVICES=0 python run_moviechat.py \
  --dataset /path/to/spookybench \
  --csv /path/to/metadata.csv \
  --categories words \
  --use_cot \
  --output ./results
```

2. Then analyze the results:

```bash
python analyze_results.py \
  --results ./results/moviechat_cot_[timestamp].json \
  --csv /path/to/metadata.csv \
  --output ./analysis_results
```
We also did human in the loop evaluation since the datset size is small. For smaller dataset we recommend reviewing the model's response manually as well.

### Output

The analysis script generates:

1. A JSON file (`analysis.json`) containing detailed results for each video and category
2. A visualization (`accuracy.png`) showing accuracy metrics for each category and overall performance

This allows you to assess how well the model performs on different categories of temporal patterns in the SpookyBench dataset.
