Abstract: While significant progress has been made in multi-modal learning driven by large-scale image-text datasets, there is still a noticeable gap in the availability of such datasets within the facial domain. To facilitate and advance the field of facial representation learning, we present FLIP-80M, a large-scale visual-linguistic dataset comprising over 80 million face images paired with text descriptions. The construction of FLIP-80M utilizes large-scale publicly available image-text-pair dataset, filtering 5 billion samples from general domain, and incorporates with AI-Generated Content (AIGC) methods for quality management and data augmentation. The data creation process involves a mixed-method pipeline to filter face-related pairs from both visual and linguistic perspectives, including face detection, face caption classification, text de-noising, and AIGC augmentation. As a result, FLIP-80M stands as the largest face-text dataset to date. It shows exceptional data quality and demonstrates the potential to enhance the performance of face representation models. To assess the efficacy of our dataset, we use contrastive learning objective to train FLIP (Facial Language-Image Pretraining) and evaluate its representation capabilities across various downstream tasks. Experimental results reveal that our FLIP model achieves state-of-the-art results cross 10 different face analysis tasks like face parsing, face alignment, and face attribute classification. The dataset and models will be publicly available.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: We introduce a large-scale text-image multimodal dataset in the face domain and provide FLIP model (Facial Language-Image Pre-Training) to perform a variety of face analysis tasks. The experimental results showcase the effectiveness of FLIP in learning generalized facial representations, highlighting its superiority over other image-text models.
Supplementary Material: zip
Submission Number: 3240
Loading