FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: video understanding, motion understanding, multimodal large language model, video description
Abstract: Multimodal Large Language Models (MLLMs) have shown impressive video content understanding capabilities but struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, which comprises 1,776 videos from both ego-centric and third-person perspectives and enables assessment through both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we employ the GPT-assisted evaluation and develop a novel cost-efficient LLM-free assessment method, where the latter can enhance benchmarking interpretability and accessibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset of 17,152 videos with fine-grained motion annotations. Finetuning Qwen2.5-VL on FAVOR-Train yields consistent improvements on motion-related tasks across TVBench, MotionBench and our FAVOR-Bench. Our assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools for the community to develop more powerful video understanding models.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/zl2048/FAVOR
Code URL: https://github.com/FAVOR-Bench/FAVOR-Bench
Supplementary Material: pdf
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Flagged For Ethics Review: true
Submission Number: 1118
Loading