Think Before You Place: Chain-of-Thought Video Editing for Environment-Aware Custom Subject Integration
Keywords: Custom Subject Integration, Vision-Language Models, Direct Preference Optimization
Abstract: Contemporary video editing methods have achieved remarkable visual fidelity for custom subject integration, yet they fundamentally lack the capability to model causally realistic interactions between inserted objects and their environments. This limitation results in physically implausible editing outcomes, violating basic physical laws.
In this work, we present ThinkPlace, an end-to-end framework that addresses these challenges by leveraging Vision-Language Models (VLM) as a reasoning brain to guide physically-aware video editing without explicit physics simulation. Our approach introduces three key innovations: First, we develop a VLM-guided chain-of-thought reasoning pipeline that generates environment-aware guidance tokens while providing physically plausible editing regions for the downstream video diffusion model. Second, we introduce a Spatial Direct Preference Optimization post-training which also employs VLM for enhancing visual naturalness of editing results.
Third, we leverage VLM for post-evaluation, triggering corrective refinement cycles that progressively improves integration quality.
Extensive experiments demonstrate ThinkPlace achieves physically-coherent custom subject integration compared with State-of-the-art solutions. Our work represents a significant step toward bridging the gap between visual quality and physical realism in video editing applications.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2957
Loading