Probing the Prompting of CLIP on Human FacesDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Large-scale multimodal models such as CLIP have caught great attention due to their generalization capability. CLIP can take free-form text prompts, but the performance varies with different text prompt manipulations, which is considered unpredictable. In this paper, we conduct a controlled study to understand how CLIP perceives images with different forms of text prompts, particularly on human facial attributes. We find that (1) using the prompt starter "a photo of" can guide the model to allocate higher attention weights to human faces, leading to better classification performance; (2) CLIP model is better at aligning information from shorter text prompts, as additional textual details shift away the attention from key words; (3) properly adding punctuation or removing stop words in the text prompt can shift attention to target information. Our practice on facial attributes shed light on the design of reliable text prompts for CLIP in other tasks.
0 Replies

Loading