Data Scaling Isn't Enough: Towards Improving Compositional Reasoning in Video-Language Models

Kibum Kim; Kyle Min; Chanyoung Park

Data Scaling Isn't Enough: Towards Improving Compositional Reasoning in Video-Language Models

Kibum Kim, Kyle Min, Chanyoung Park

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Foundation Model, Compositional Reasoning, Video Scene Graph

TL;DR: To enhance compositional reasoning efficiently, we propose SGCR-Vid, method designed to effectively leverage the video scene graph datasets.

Abstract: Research in Video-Language Models has focused on developing Video Foundation Models (ViFMs) that achieve strong zero-shot performance by scaling video-text pair datasets. Meanwhile, the compositional reasoning abilities of ViFMs have gained increasing attention, leading to a critical question: *Does scaling video-text pairs consistently enhance compositional reasoning?* Based on our finding that simply increasing the dataset size does not necessarily improve compositional reasoning, we explore whether compositional reasoning can be enhanced using a small, high-quality dataset instead of relying on dataset scaling. To this end, we focus on video scene graph (VidSG) datasets, which provide rich, structured relational information, and propose SGCR-Vid, a method designed to effectively leverage this information. To evaluate the effectiveness of SGCR-Vid, we apply it to two state-of-the-art ViFMs, demonstrating significant performance improvements on compositional reasoning benchmarks, using less than 0.5\% of the pretraining data scale. Our results show that compositional reasoning can be effectively enhanced using an extremely small-scale dataset.

Submission Number: 43

Loading