Text-guided Visual Prompt Tuning with Masked Images for Facial Expression Recognition

Rongkang Dong, Cuixin Yang, Kin-Man Lam

Published: 01 Jan 2024, Last Modified: 08 Apr 2025APSIPA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Facial expression recognition (FER) has significantly advanced through the application of deep learning techniques for visual content classification. Recent research has explored the use of pre-trained language-image models, such as CLIP, which leverage natural language supervision to enhance image backbone training and facilitate the learning of general visual representations. Concurrently, visual prompt tuning has emerged as a method to minimize tuning overhead for downstream tasks by freezing the pre-trained backbone models and incorporating additional learnable parameters, known as visual prompts, into the model input. This strategy circumvents the need to update the entire neural network, focusing instead on optimizing visual prompts for specific tasks. In this study, we propose a novel tuning scheme, namely Text-guided Visual Prompt Tuning with Masked facial images (T-VPT-M), for both basic and compound FER. Our method utilizes natural language supervision for visual prompt learning and employs a random masking mechanism to adapt visual prompts to diverse informative facial regions. Experimental results on three real-world datasets, encompassing both basic and compound facial expressions, demonstrate the efficacy of the T-VPT-M scheme.