Keywords: Video Understanding, Multimodal LLMs, Fine-grained Motion
Abstract: Despite advancements in Multimodal Large Language Models (MLLMs), their proficiency in fine-grained video motion understanding remains critically limited. They often lack inter-frame differencing and tend to average or ignore subtle visual cues. Furthermore, while visual prompting has shown potential in static images, its application to videos' temporal complexities, particularly for fine-grained motion understanding, remains largely unexplored. We investigate whether inherent capability can be unlocked to boost MLLMs' motion perception and enable distinct visual signatures tailored to decouple object and camera motion cues. In this study, we introduce $\mathtt{MotionSight}$, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to effectively improve fine-grained motion understanding without training. To convert this into valuable data assets, we curated $\mathtt{MotionVid-QA}$, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, $\Theta{(40K)}$ video clips and $\Theta{(87K)}$ QAs. Experiments show $\mathtt{MotionSight}$ achieves state-of-the-art open-source performance and competitiveness with commercial models. Using $\mathtt{MotionVid-QA}$, we fine-tuned $\mathtt{MotionChat}$ on Qwen2.5VL-7B, which attains 48.3\% overall accuracy on FAVOR-Bench that is comparable to Qwen2.5VL-72B's 48.1\%. In summary, we present a novel zero-shot method and a large-scale, high-quality dataset specifically for fine-grained motion understanding. All the code and annotations will be publicly available.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 617
Loading