SELDOM: Scene Editing via Latent Diffusion with Object-centric Modifications

Richard Ely Locke Higgins, David Fouhey

Published: 20 Oct 2025, Last Modified: 14 Nov 2025ICCV2025EveryoneCC BY 4.0

Abstract: We introduce SELDOM to composably edit scenes—mixing and matching objects with backgrounds, camera changes, and object-centric edits. SELDOM is a 3D-aware diffusion editing method which conditions on sequences of “neural nouns and verbs”. Neural nouns represent scene state and are visual features extracted from source image(s). Neural verbs are learnt representations for image edits, formed either by explicitly parsing prompts or implicitly attending to them. Neural verbs combine with their associated neural nouns to convey object-centric transformations. Finally, a sequence of these tokens is composed with scene background tokens and used as conditioning for a finetuned latent diffusion model. Our factorization affords test time compositionality, allowing us to compose edited objects from multiple datasets into a single scene. We further demonstrate our model’s ability to photo-edit: SELDOM can convincingly edits scenes to change object hue and lighting, scale and rotate objects, apply diverse language based edits, and control camera rotation and translation.