Keywords: Video Foundation Model, Compositional Reasoning, Video Scene Graph
TL;DR: To enhance compositional reasoning efficiently, we propose SGCR-Vid, method designed to effectively leverage the video scene graph datasets.
Abstract: Research in Video-Language Models has focused on developing Video Foundation Models (ViFMs) that achieve strong zero-shot performance by scaling video-text pair datasets. Meanwhile, the compositional reasoning abilities of ViFMs have gained increasing attention, leading to a critical question: *Does scaling video-text pairs consistently enhance compositional reasoning?* Based on our finding that simply increasing the dataset size does not necessarily improve compositional reasoning, we explore whether compositional reasoning can be enhanced using a small, high-quality dataset instead of relying on dataset scaling. To this end, we focus on video scene graph (VidSG) datasets, which provide rich, structured relational information, and propose SGCR-Vid, a method designed to effectively leverage this information. To evaluate the effectiveness of SGCR-Vid, we apply it to two state-of-the-art ViFMs, demonstrating significant performance improvements on compositional reasoning benchmarks, using less than 0.5\% of the pretraining data scale. Our results show that compositional reasoning can be effectively enhanced using an extremely small-scale dataset.
Submission Number: 43
Loading