Keywords: Computer Vision, Virtual Try-On, Diffusion Models, Transformers, Positional Encoding, Evaluation Metrics
Abstract: Recent advancements in pre-trained diffusion models have significantly enhanced image-based virtual try-on, enabling the realistic synthesis of garments for simple textures. However, preserving high-frequency patterns and text consistency remains a formidable challenge, as existing methods often fail to retain fine-grained details. To address this, we introduce PR-VTON, a simple yet effective method that integrates a Position-Refined Positional Encoding (termed PRPE) and a lightweight positional relation learning module (termed PRL) to enhance detail preservation across diverse fabric designs. Specifically, PRPE leverages the inherent impact of positional encoding on attention mechanisms within the Diffusion Transformer (DiT) architecture, guiding attention maps with precise positional cues to achieve superior texture fidelity without additional modules or complex loss functions. Meanwhile, PRL explicitly models token-level correspondences between garments and target bodies, ensuring accurate spatial alignments. Extensive experiments on standard benchmarks demonstrate that PR-DIT surpasses existing baselines in both quantitative and qualitative metrics, with marked improvements in perceptually sensitive areas, such as textual logos. Furthermore, we critically reassess evaluation protocols for virtual try-on, highlighting deficiencies in existing metrics for capturing global consistency and fine detail fidelity,and propose a detail-focused metric loc-CMMD, establishing a more robust standard for high-resolution virtual try-on research.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4504
Loading