Keywords: 3D-aware visual editing, visual compositing, image editing, disentangled object control
TL;DR: BlenderFusion combines the accurate 3D geometric control of Blender with a generative compositor (adapted from Stable Diffusion) to enable precise geometry editing and versatile visual composition.
Abstract: We present BlenderFusion, a generative visual compositing framework that recomposes objects, camera, and background to synthesize new scenes. It follows a layering-editing-compositing pipeline that (i) segments and converts visual inputs into editable 3D entities (layering), (ii) edits them in Blender with 3D-grounded control (editing), and (iii) fuses them into a coherent scene using a generative compositor (compositing).
The generative compositor extends a pre-trained diffusion model to process both the original (source) and edited (target) scenes in parallel, and is fine-tuned on video frames with two important training strategies: (i) source masking, enabling flexible modifications like background replacement; (ii) simulated object jittering, facilitating disentangled control over the objects and camera.
Extensive experiments on synthetic and real-world datasets show that BlenderFusion significantly outperforms prior methods in precise 3D-aware control and complex compositional scene editing. The framework also generalizes to unseen data and fine-grained editing operations beyond the training distribution.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3529
Loading