Your Latent Mask is Wrong: Decoder-Equivariant Compositing for Diffusion Inpainting

Rowan Bradbury; Elea Zhong

Your Latent Mask is Wrong: Decoder-Equivariant Compositing for Diffusion Inpainting

Rowan Bradbury, Elea Zhong

20 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: diffusion models, inpainting, latent diffusion, vae, latent blending, computer vision

Abstract: Linearly interpolating between VAE latents using a downsampled mask field remains a common heuristic for diffusion inpainting. However, this approach systematically violates a key principle: latent compositing must respect decoder equivariance; decoding after compositing should approximate compositing after decoding. Because VAE latents capture global context rather than pixel-local structure, linear interpolation fails this requirement, producing seams, color shifts, and halos that diffusion subsequently amplifies into larger artifacts. We propose a decoder-equivariant latent compositor (DELC) instantiated as a 14M-parameter transformer ("DecFormer") that predicts full channel blend weights and a nonlinear residual correction for mask-consistent latent fusion. DELC is trained so that decoding after latent fusion approximates pixel-space alpha compositing, enabling pixel-alpha-equivalent results directly in latent space without modifying the diffusion or VAE components. DecFormer is lightweight and FLOP-efficient at 0.07\% Flux.1-Dev's parameters and 9.26\% of the VAE's parameters, and is plug-compatible with existing diffusion pipelines, requiring no backbone finetuning. In inpainting experiments with DecFormer trained on the Flux.1 family of models, DELC restores global color consistency, reduces boundary artifacts and facilitates high-fidelity masking over heuristic. We further show finetuning a lightweight LoRA on Flux.1-Dev for inpainting with a DecFormer prior achieves comparable fidelity to Flux.1-Fill, a full-parameter finetuned inpainting model. Though we demonstrate DELC on inpainting, the approach generalizes to any latent-space operation requiring decoder equivariance.

Primary Area: generative models

Submission Number: 24588

Loading