GenCue: Generation-Oriented Video Captions for High-Fidelity Text-to-Video

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Large Language Models, Video Detailed Captioning, Text-to-Video Generation
Abstract: The performance of text-to-video (T2V) models critically depends on the quality of their training captions. However, captions produced by current multimodal large language models (MLLMs) often lack fine-grained visual grounding, temporal coverage, and cinematic expressiveness, limiting models’ ability to accurately follow instructions and reconstruct details and dynamics. We present GenCue, a generation-oriented video captioning framework designed to close this gap and enable high-fidelity T2V training. We first introduce the GenCue-SFT-1M and GenCue-RL-8k data suite: the former is a large-scale, schema-guided corpus aided by specialized expert models, while the latter is a carefully curated, high-quality subset providing precise supervision signals for post-training. Building on this data foundation, we propose a reinforcement learning paradigm with a checklist-based reward that explicitly evaluates key generation dimensions. We further introduce Reference-Augmented GRPO and a prefix-sharing rollout strategy, which together enable effective and efficient long-context optimization. Experiments on T2V captioning and reconstruction benchmarks demonstrate that GenCue significantly outperforms prior approaches, yielding substantial improvements in object coverage, temporal coherence, and cinematic quality.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11216
Loading