Hunting Blemishes: Language-guided High-fidelity Face Retouching Transformer with Limited Paired Data
Abstract: The prevalence of multimedia applications has led to increased concerns and demand for auto face retouching. Face retouching aims to enhance portrait quality by removing blemishes. However, the existing auto-retouching methods rely heavily on a large amount of paired training samples, and perform less satisfactorily when handling complex and unusual blemishes. To address this issue, we propose a Language-guided Blemish Removal Transformer for automatically retouching face images, while at the same time reducing the dependency of the model on paired training data. Our model is referred to as LangBRT, which leverages vision-language pre-training for precise facial blemish removal. Specifically, we design a text-prompted blemish detection module that indicates the regions to be edited. The priors not only enable the transformer network to handle specific blemishes in certain areas, but also reduce the reliance on retouching training data. Further, we adopt a target-aware cross attention mechanism, such that the blemish regions are edited accurately while at the same time maintaining the normal skin regions unchanged. Finally, we adopt a regularization approach to encourage the semantic consistency between the synthesized image and the text description of the desired retouching outcome. Extensive experiments are performed to demonstrate the superior performance of LangBRT over competing auto-retouching methods in terms of dependency on training data, blemish detection accuracy and synthesis quality.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: This work performs face retouching based on textual prompts, which contributes significantly to multimedia/multi-modal processing in several ways: Semantic Understanding Integration: By interpreting textual prompts describing blemish types and locations, the model gains a deeper understanding of the semantics associated with facial features. This integration enhances the model’s ability to comprehend and process multi-modal data. Customized Image Enhancement: Textual prompts allow for tailored face retouching strategies specific to individual needs, as indicated by the user-supplemented information. This customization leads to more precise and targeted image enhancements, optimizing the user experience in multimedia applications.
Supplementary Material: zip
Submission Number: 4831
Loading