Keywords: Video Generation, Multi-agent
Abstract: Recent text-to-video (T2V) diffusion models have made remarkable progress in
generating high-quality and diverse videos. However, they often struggle to align
with complex text prompts, particularly when multiple objects, attributes, or spatial
relations are specified. We introduce VideoRepair, the first self-correcting,
training-free, and model-agnostic video refinement framework that automatically
detects fine-grained text–video misalignments and performs targeted, localized
corrections. Our key insight is that even misaligned videos usually contain correctly
rendered regions that should be preserved rather than regenerated. Building on this
observation, VideoRepair proposes a novel region-preserving refinement strategy
with three stages: (i) misalignment detection, where systematic MLLM-based evaluation
with automatically generated spatio-temporal questions identifies faithful
and misaligned regions; (ii) refinement planning, which preserves correctly generated
entities, segments their regions across frames, and constructs targeted prompts
for misaligned areas; and (iii) localized refinement, which selectively regenerates
problematic regions while preserving faithful content through joint optimization
of preserved and newly generated areas. This self-correcting, region-preserving
strategy converts evaluation signals into actionable guidance for refinement, enabling
efficient and interpretable corrections. On two challenging benchmarks,
EvalCrafter and T2V-CompBench, VideoRepair achieves substantial improvements
over recent baselines across diverse alignment metrics. Comprehensive
ablations further demonstrate the efficiency, robustness, and interpretability of our
framework.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 13771
Loading