Abstract: We introduce SELDOM to composably edit scenes—mixing
and matching objects with backgrounds, camera changes,
and object-centric edits. SELDOM is a 3D-aware diffusion
editing method which conditions on sequences of “neural
nouns and verbs”. Neural nouns represent scene state and
are visual features extracted from source image(s). Neural verbs are learnt representations for image edits, formed
either by explicitly parsing prompts or implicitly attending to them. Neural verbs combine with their associated
neural nouns to convey object-centric transformations. Finally, a sequence of these tokens is composed with scene
background tokens and used as conditioning for a finetuned latent diffusion model. Our factorization affords test time compositionality, allowing us to compose edited objects from multiple datasets into a single scene. We further
demonstrate our model’s ability to photo-edit: SELDOM
can convincingly edits scenes to change object hue and
lighting, scale and rotate objects, apply diverse language based edits, and control camera rotation and translation.
Loading