# MovieChat-1K

## Task Description

This repository contains an evaluation dataset designed for assessing the long video understanding performance of video models. The dataset includes human-generated question-answer pairs for the video both in global mode and breakpoint mode. The evaluation focuses on multiple dimensions of the responses generated by GPT-3.5.

- GPT-3.5 Evaluation: The answers are evaluated using the prompts designed by Video-ChatGPT, which rates the responses based on the aforementioned dimensions with `gpt-3.5-turbo-0125`.

## Groups & Tasks

### Tasks

- `moviechat_global`: Given a video and a question, generate an answer using information from the entire video.
- `moviechat_breakpoint`: Given a video, a specific timestamp, and a question, generate an answer using the video segments that occur before the specified timestamp.
  
## Model Performance Comparison

| **Model**            | **Global Acc** | **Global Score** |
|----------------------|--------------------------|-------------------------|
| MovieChat(VideoLLaMA)       | 62.3            | 3.23           | 
| MovieChat+(VideoLLaMA)       | 71.2           | 3.51         | 
| MovieChat(LLaVA-OneVision)       | 79.00             | 4.20           |



## Citation

```bibtex
@inproceedings{song2024moviechat,
  title={Moviechat: From dense token to sparse memory for long video understanding},
  author={Song, Enxin and Chai, Wenhao and Wang, Guanhong and Zhang, Yucheng and Zhou, Haoyang and Wu, Feiyang and Chi, Haozhe and Guo, Xun and Ye, Tian and Zhang, Yanting and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={18221--18232},
  year={2024}
}
```