Abstract: Match-cuts are powerful cinematic tools that create seamless transitions between scenes, delivering strong visual and metaphorical connections. However, crafting impactful match-cuts is a challenging and resource-intensive process that requires deliberate artistic planning throughout the production pipeline. In this work, we introduce MatchDiffusion, a training-free method that uses text-to-video diffusion models to automatically generate match-cuts. As such, MatchDiffusion is the first method for match-cut generation. Our method leverages an inherent property of diffusion models, whereby the early denoising steps determine the broad appearance of the scene, while the latter steps add details. Motivated by this property, MatchDiffusion first performs "Joint Diffusion", by initializing generation for two prompts from a shared noise sample, and following a shared denoising path for the first denoising steps. This process results in the two videos sharing structural and motion characteristics. After Joint Diffusion, we then conduct "Disjoint Diffusion", allowing the videos' denoising paths to diverge and introduce their unique details. MatchDiffusion thus yields visually coherent videos that are amenable to match-cuts. We demonstrate the effectiveness of our method through user studies and metrics, showing its potential to democratize match-cut creation.
Loading