Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

ICLR 2026 Conference Submission13771 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Generation, Multi-agent

Abstract: Recent text-to-video (T2V) diffusion models have made remarkable progress in generating high-quality and diverse videos. However, they often struggle to align with complex text prompts, particularly when multiple objects, attributes, or spatial relations are specified. We introduce VideoRepair, the first self-correcting, training-free, and model-agnostic video refinement framework that automatically detects fine-grained text–video misalignments and performs targeted, localized corrections. Our key insight is that even misaligned videos usually contain correctly rendered regions that should be preserved rather than regenerated. Building on this observation, VideoRepair proposes a novel region-preserving refinement strategy with three stages: (i) misalignment detection, where systematic MLLM-based evaluation with automatically generated spatio-temporal questions identifies faithful and misaligned regions; (ii) refinement planning, which preserves correctly generated entities, segments their regions across frames, and constructs targeted prompts for misaligned areas; and (iii) localized refinement, which selectively regenerates problematic regions while preserving faithful content through joint optimization of preserved and newly generated areas. This self-correcting, region-preserving strategy converts evaluation signals into actionable guidance for refinement, enabling efficient and interpretable corrections. On two challenging benchmarks, EvalCrafter and T2V-CompBench, VideoRepair achieves substantial improvements over recent baselines across diverse alignment metrics. Comprehensive ablations further demonstrate the efficiency, robustness, and interpretability of our framework.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 13771

Loading