Keywords: Image edit;Diffusion model
Abstract: Rectified flow text-to-image models have shown remarkable progress. However, editing complex scenes containing multiple objects remains challenging due to semantic entanglement and structural inconsistency. To address this, we propose a dual-domain framework that jointly refines temporal editing trajectories and adapts frequency domain. First, we design a Starting Point Optimization (SPO) strategy, which intelligently determines the optimal editing starting point based on the structural complexity of different images. Second, we introduce a Trajectory Optimization (TO) strategy. In the time domain, it performs semantic-aware vector orthogonalization to suppress source bias while preserving target semantics. In the frequency domain, it adaptively re-weights high and low frequency residuals according to stage-specific spectral characteristics. Furthermore, we leverage the frequency-aware capabilities of MM-DiT to dynamically inject structural priors from the source image at different denoising steps.Our method allows users to add, replace, or modify multiple objects, making it highly efficient for editing complex scenes. Experiments show that our method significantly outperforms existing methods for image editing and achieving higher user preference in human evaluations.
Primary Area: generative models
Submission Number: 12961
Loading