VideoMB: Steering Representations towards Motion Balanced Caption Generation in Vision-Language Models

Yarden Bakish; Lior Wolf

VideoMB: Steering Representations towards Motion Balanced Caption Generation in Vision-Language Models

Yarden Bakish, Lior Wolf

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Language Models, Vision Understanding, Motion Perception

TL;DR: We introduce VideoMB, a framework for transforming video language models from being appearance-biased to motion-balanced, by enhancing moving object representation through cross-attention mechanisms and dual-objective fine-tuning.

Abstract: Large Vision-Language Models (VLMs) need to balance between spatio-temporal understanding and computational efficiency. For standard architectures, this trade-off is largely determined by balancing visual feature compression and spatial fidelity. We show that this limitation results in heavy bias towards appearance, often failing to discern or caption moving objects in a video. To address this, we introduce VideoMB, a novel framework that orchestrates feature embedding manipulation, fundamentally reshaping the model's perceptual priorities. VideoMB incorporates cross-attention layers which establish temporal understanding by modeling information flow between consecutive frames. We propose a fine-tuning paradigm which jointly optimizes caption generation with a global matching objective, constraining learned visual representations to exhibit maximal similarity with the embeddings corresponding to the positions of moving objects. Our approach is computationally efficient and can be seamlessly integrated into existing models. Extensive experiments demonstrate that VideoMB significantly improves motion-based captioning accuracy, particularly for challenging scenarios involving small or low-resolution moving objects, while maintaining competitive performance on appearance-focused tasks. These findings offer a generalizable solution for steering attention towards desired visual elements, providing fine-grained control over perceptual focus in video understanding tasks.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 10764

Loading