HELM: Hidden-State Evaluation for Latent Modulation of Virtual Try-On Diffusion Transformers
Abstract: Transformer-based virtual try-on (VTON) models generate high-fidelity garment transfers and score well on standard benchmarks, yet continue to produce subtle failures such as wrong sleeve length, misplaced collars, and drifted patterns. Practitioners typically attempt to reduce such failures at inference time via best-of-$N$ sampling, in which the model draws $N$ candidate images and an external verifier, typically a learned reward model, returns the top-ranked one. Best-of-$N$ ranks a completed pool but cannot steer sampling toward better candidates. Diffusion-style inference-time steering addresses this, but presupposes a reward model evaluable on noisy intermediate samples, which is not available off the shelf for VTON. Even when such a reward model is available, existing pointwise VTON evaluators, from learned image metrics such as LPIPS and SSIM to VLM or human judges, are miscalibrated for the fine-grained failures that drive VTON quality. We show that a linear probe, a single linear readout of the model's own intermediate hidden state, can rank sampling trajectories by quality. The probe is supervised by pairwise VLM preferences over same-input seed pairs, which cancel out input difficulty and leave only seed-dependent quality variation to drive the learning signal. Across three transformer-based VTON models evaluated on DressCode upper-body and VITON-HD, the steered arm improves VTON quality over the matched-budget baseline under a pairwise VLM judge in every configuration tested. Mid-sampling steering, which usually requires training a new reward model on noisy intermediates, thus reduces to a logistic regression on cached activations of the generator itself: a drop-in scoring function for any VTON best-of-$N$ pipeline.
Loading