Keywords: Face Restoration, Image Super-resolution, Diffusion Model
Abstract: Blind face restoration from low-quality (LQ) images is a challenging task that requires not only high-fidelity image reconstruction, but also preservation of facial identity. Although diffusion models like Stable Diffusion have shown promise in generating high-quality (HQ) images, their VAE modules are typically trained only on HQ data, resulting in semantic misalignment when encoding LQ inputs. This mismatch significantly weakens the effectiveness of LQ conditions during the denoising process. Existing approaches often tackle this issue by retraining the VAE encoder, which is computationally expensive and memory intensive.
To address this limitation efficiently, we propose $\textbf{LAFR}$ ($\textbf{L}$atent $\textbf{A}$lignment for $\textbf{F}$ace $\textbf{R}$estoration), a novel codebook-based latent space adapter that aligns the latent distribution of LQ images with that of HQ counterparts, enabling semantically consistent diffusion sampling without altering the original VAE. To further enhance identity preservation, we introduce a multilevel restoration loss that combines constraints from identity embeddings and facial structural priors. Furthermore, by leveraging the inherent structural regularity of facial images, we show that lightweight finetuning of diffusion prior on just $\textbf{0.9}$% of FFHQ dataset is sufficient to achieve results comparable to state-of-the-art methods, reduce training time by $\textbf{70}$%.
Extensive experiments on both synthetic and real-world face restoration benchmarks demonstrate the effectiveness and efficiency of LAFR, achieving high-quality, identity-preserving face reconstruction from severely degraded inputs.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2895
Loading