Abstract: We present M &M VTO-a mix and match virtual try-on method that takes as input multiple garment images, text de-scription for garment layout and an image of a person. An example input includes: an image of a shirt, an image of a pair of pants, “rolled sleeves, shirt tucked in”, and an image of a person. The output is a visualization of how those garments (in the desired layout) would look like on the given person. Key contributions of our method are: 1) a single stage diffusion based model, with no super resolution cas-cading, that allows to mix and match multiple garments at $1024\times 512$ resolution preserving and warping intricate garment details, 2) architecture design (VTO UN et Diffusion Transformer) to disentangle denoising from person specific features, allowing for a highly effective finetuning strategy for identity preservation (6MB model per individual vs 4GB achieved with, e.g., dreambooth finetuning); solving a common identity loss problem in current virtual try-on meth-ods, 3) layout control for multiple garments via text inputs finetuned over PaLI-3 [8] for virtual try-on task. Experimental results indicate that M &M VTO achieves state-of-the-art performance both qualitatively and quantitatively, as well as opens up new opportunities for virtual try-on via language-guided and multi-garment try-on.
Loading