FreeEyeglass: Training-free and Target-mask-free Eyeglass Transfer for Facial Videos

TMLR Paper7337 Authors

04 Feb 2026 (modified: 17 Apr 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The rise of e-commerce and short-video platforms has fueled demand for realistic video-based virtual try-on. Unlike virtual try-on of clothing, which has been actively studied to date, virtual try-on of eyeglasses is uniquely challenging: they align closely with facial structure and strongly affect facial identity, making the faithful preservation of unedited regions especially important. Existing generative editing approaches, such as GAN- and diffusion-based methods, lack reconstruction objectives and often rely on inpainting, which fails to ensure identity consistency. We argue that semantic editing requires not only plausible generation but also faithful reconstruction, making autoencoder-based latent spaces a natural fit. We introduce a training-free, reference-guided framework for video eyeglass transfer built on Diffusion Autoencoders (DiffAE). By blending semantic features in the encoder and incorporating spatial-temporal self-attention, our method achieves realistic, identity-preserving, and temporally consistent results, and points to the potential of autoencoder-based latent spaces for local video editing. Our implementations and datasets will be released upon acceptance.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We thank the reviewers for their constructive feedback. We have revised the manuscript accordingly. The main changes are: - Identity fidelity evaluation. Added ArcFace-based identity similarity metrics and updated the discussion (Sec. 4.2, Table 3). - Smoothing effect clarification. Explained that smoothing arises from stochastic latent approximation and boundary blending, with an ablation showing the sharpness–artifact trade-off (Sec. 4.2, Supp. Sec. C.8). - Quantitative robustness analysis. Added breakdowns across head pose (yaw bins) and occlusion levels, showing stable performance under moderate conditions (Sec. 4.4, Tables 4–5, Supp. Sec. C.2–C.3). - CG dataset transparency. Clarified its role as a controlled evaluation and added full per-video results in the supplementary (Sec. 4.1, Sec. 4.3, Supp. Sec. C.9). - Method clarification. Revised Sec. 3.2 to clarify that feature blending operates in the semantic latent space, enabling non-rigid integration. - Lighting limitation. Explicitly stated the absence of reflection/illumination modeling and positioned it as future work (Conclusion, Supp. Sec. G). - Expanded limitations. Extended discussion of failure cases, including dominant original glasses and large geometric mismatch (Supp. Sec. G). - Metric interpretation. Added analysis of CLIP-I and DINO-I behavior (Supp. Sec. C.10). - Clarity and qualitative improvements. Refined the introduction and updated Fig. 3 for a more representative example.
Assigned Action Editor: ~Venkatesh_Babu_Radhakrishnan2
Submission Number: 7337
Loading