Controlling Video Generation with Vision Language Models

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Controllable Video Generation
Abstract: Controlling video generation models typically requires finetuning on video datasets with explicit control labels. However, collecting such datasets is costly, and the control modality in the data inherently restricts the controllability of the trained models. In contrast, vision language models (VLMs) can readily generalize to new tasks with pretrained knowledge and in-context learning. Motivated by this capability, we introduce Ask-A-Video, a test-time training paradigm that formulates controllable video generation as visual question answering (VQA): a video generator produces video frames, a frozen VLM answers control-related questions, and the VQA loss is directly backpropagated to the video generator. By leveraging the generalization of VLMs, Ask-A-Video enables efficient and flexible control for any off-the-shelf video generator without the need for any video data. Empirically, our method improves controllability for both text-to-video and image-to-video models across different families and scales. Compared to adding constraints via prompt extension, Ask-A-Video yields stronger prompt following and more physically plausible dynamics. It also enables fine-grained spatial and motion control through visual prompting. In addition, since our method distills controllability into the model weights, it allows reusing the learned control for new prompts without additional cost.
Primary Area: generative models
Submission Number: 11679
Loading