Keywords: virtual try-on, dense matching
Abstract: Virtual Try-On (VTON) aims to outfit a person with a specific garment from paired
person and garment images. Recent diffusion-based approaches show promising
results but still struggle to preserve fine-grained details such as logos, patterns, and
textures. We suggest these failures come from inaccurate query–key matching in
attention maps. To analyze this, we introduce a correspondence evaluation frame-
work that extracts dense correspondences from attention maps and evaluates them
with pseudo ground-truth matches. Using this framework, we analyze a simple
DiT-based baseline and observe that its attention maps in most layers fail to cap-
ture reliable semantic correspondences. We then propose CORAL, a lightweight
regularization strategy with two components: correspondence loss, which cor-
rects where each query attends by aligning it with reliable external matches, and
entropy loss, which sharpens attention for more confident matching. CORAL
improves person–garment alignment in our baseline and can be applied to other
diffusion-based pipelines without architectural changes.
Primary Area: generative models
Submission Number: 5169
Loading