BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing

Jiacheng Chen; Ramin Mehran; Xuhui Jia; Saining Xie; Sanghyun Woo

BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing

Jiacheng Chen, Ramin Mehran, Xuhui Jia, Saining Xie, Sanghyun Woo

10 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D-aware visual editing, visual compositing, image editing, disentangled object control

TL;DR: BlenderFusion combines the accurate 3D geometric control of Blender with a generative compositor (adapted from Stable Diffusion) to enable precise geometry editing and versatile visual composition.

Abstract: We present BlenderFusion, a generative visual compositing framework that recomposes objects, camera, and background to synthesize new scenes. It follows a layering-editing-compositing pipeline that (i) segments and converts visual inputs into editable 3D entities (layering), (ii) edits them in Blender with 3D-grounded control (editing), and (iii) fuses them into a coherent scene using a generative compositor (compositing). The generative compositor extends a pre-trained diffusion model to process both the original (source) and edited (target) scenes in parallel, and is fine-tuned on video frames with two important training strategies: (i) source masking, enabling flexible modifications like background replacement; (ii) simulated object jittering, facilitating disentangled control over the objects and camera. Extensive experiments on synthetic and real-world datasets show that BlenderFusion significantly outperforms prior methods in precise 3D-aware control and complex compositional scene editing. The framework also generalizes to unseen data and fine-grained editing operations beyond the training distribution.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 3529

Loading