Abstract: Face image inpainting, especially with user-controllable customization, aims to restore degraded facial regions while adhering to user-provided instructions. Traditional inpainting methods often focus solely on restoring visual fidelity, lacking the ability to incorporate user prompts or semantic guidance. In this work, we present InpaintFormer, a novel framework for user-controlled face image inpainting guided by textual prompts. Specifically, we propose a Prompt-guided Feature Modulation (PGFM) module to align visual features with user instructions by utilizing a pre-trained CLIP model to extract text and image embeddings. These embeddings are fused to modulate the encoded image features, ensuring semantic consistency with the prompt. Additionally, a Degradation Mask Predictor (DMP) is introduced to identify degraded regions requiring inpainting, while a Mask-Aware Self-Attention (MASA) mechanism within the Transformer refines the inpainting process by selectively attending to non-degraded regions for generating realistic results. By combining PGFM, DMP, and MASA, InpaintFormer enables controllable face image inpainting with high fidelity and semantic alignment. Extensive experiments demonstrate that InpaintFormer outperforms state-of-the-art inpainting methods in terms of controllability and naturalness.
External IDs:dblp:conf/icmcs/OuyangXCHWXCW25
Loading