CCC: Enhancing Video Generation via Structured MLLM Feedback

Published: 10 Jun 2025, Last Modified: 11 Jul 2025PUT at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Evaluation, Video Generation
TL;DR: Prompt Agent for Text-to-Video Generation
Abstract: Video generation from natural-language prompts has made impressive strides, but current systems frequently misalign outputs with their input descriptions, dropping critical details, and hallucinating unintended content. Existing approaches to improving video quality typically rely on heavyweight post-editing models, which may introduce new artifacts, or costly fine-tuning of the generator backbone, limiting scalability and accessibility. While multimodal large language models (MLLMs) have demonstrated strong capabilities in diagnosing visual-text misalignment, their use has largely focused on image-level improvement rather than video. Therefore, we introduce *Critique Coach Calibration* (*CCC*), a training-free, test-time prompt-adaptation framework that closes the loop between generation and evaluation. In each iteration, an off-the-shelf MLLM produces a structured critique of a generated video, highlighting misaligned semantics, subject drift, and missing objects, and then reformulates the input prompt based on its own feedback. By repeating this critique–coach cycle, *CCC* drives steady improvements in video quality without modifying the generator or relying on external editing modules. Empirical results on diverse video scenarios demonstrate that our approach consistently enhances semantic alignment and visual quality.
Submission Number: 28
Loading