Keywords: diffusion models, inpainting, latent diffusion, vae, latent blending, computer vision
Abstract: Linearly interpolating between VAE latents using a downsampled mask field remains a common heuristic for diffusion inpainting. However, this approach systematically violates a key principle: latent compositing must respect decoder equivariance; decoding after compositing should approximate compositing after decoding. Because VAE latents capture global context rather than pixel-local structure, linear interpolation fails this requirement, producing seams, color shifts, and halos that diffusion subsequently amplifies into larger artifacts.
We propose a decoder-equivariant latent compositor (DELC) instantiated as a 14M-parameter transformer ("DecFormer") that predicts full channel blend weights and a nonlinear residual correction for mask-consistent latent fusion. DELC is trained so that decoding after latent fusion approximates pixel-space alpha compositing, enabling pixel-alpha-equivalent results directly in latent space without modifying the diffusion or VAE components.
DecFormer is lightweight and FLOP-efficient at 0.07\% Flux.1-Dev's parameters and 9.26\% of the VAE's parameters, and is plug-compatible with existing diffusion pipelines, requiring no backbone finetuning.
In inpainting experiments with DecFormer trained on the Flux.1 family of models, DELC restores global color consistency, reduces boundary artifacts and facilitates high-fidelity masking over heuristic. We further show finetuning a lightweight LoRA on Flux.1-Dev for inpainting with a DecFormer prior achieves comparable fidelity to Flux.1-Fill, a full-parameter finetuned inpainting model. Though we demonstrate DELC on inpainting, the approach generalizes to any latent-space operation requiring decoder equivariance.
Primary Area: generative models
Submission Number: 24588
Loading