TL;DR: FlowDrag leverages 3D mesh deformation to guide stable, geometry-consistent drag-based image editing and introduces VFD-Bench, a benchmark with explicit ground-truth edits.
Abstract: Drag-based editing allows precise object manipulation through point-based control, offering user convenience. However, current methods often suffer from a geometric inconsistency problem by focusing exclusively on matching user-defined points, neglecting the broader geometry and leading to artifacts or unstable edits. We propose FlowDrag, which leverages geometric information for more accurate and coherent transformations. Our approach constructs a 3D mesh from the image, using an energy function to guide mesh deformation based on user-defined drag points. The resulting mesh displacements are projected into 2D and incorporated into a UNet denoising process, enabling precise handle-to-target point alignment while preserving structural integrity. Additionally, existing drag-editing benchmarks provide no ground truth, making it difficult to assess how accurately the edits match the intended transformations. To address this, we present VFD (VidFrameDrag) benchmark dataset, which provides ground-truth frames using consecutive shots in a video dataset. FlowDrag outperforms existing drag-based editing methods on both VFD Bench and DragBench.
Lay Summary: Editing images by simply dragging points—like moving a person’s arm or rotating an animal’s head—can often lead to unrealistic or distorted results. Current methods frequently focus only on the specific points users move, ignoring the object’s overall shape and structure. This leads to what we call the “geometric inconsistency problem,” where edits become unnatural and incoherent.
To solve this, we introduce FlowDrag, a method that incorporates structured 3D mesh deformation into drag-based editing. Specifically, FlowDrag first constructs a 3D mesh representation of the object, then uses carefully calculated mesh deformations to guide image edits. This ensures that object transformations maintain realism and geometric consistency, significantly reducing unnatural distortions and instability.
Additionally, we created a new benchmark dataset called VFD (VidFrameDrag) from real video datasets. VFD provides clearly defined ground-truth transformations between consecutive video frames, enabling more accurate and reliable evaluation and comparison of drag-based editing methods.
Primary Area: Applications->Computer Vision
Keywords: Drag-based editing, Image editing, Diffusion model
Submission Number: 7551
Loading