Stress has rapidly emerged as a significant public health concern in the contemporary society, necessitating prompt identification and effective intervention strategies. Video-based stress detection offers a non-invasive, low-cost, and mass-reaching approach for identifying stress. In this paper, we propose a three-level content-semantic-world knowledge framework, addressing three particular issues for video-based stress detection. (1) How to abstract and encode video semantics with frame contents into visual representation? (2) How to leverage general-purpose LMMs to augment task-specific visual representation? (3) To what extent could general-purpose LLMs contribute to video-based stress detection? We design a Slow-Emotion-Fast-Action scheme to encode fast temporal changes of body actions revealed from video frames, as well as subtle details of emotions per video segment, into visual representation. We augment task-specific visual representation with linguistic facial expression descriptions by prompting general-purpose Large Multimodal Models (LMMs). A knowledge retriever is built to evaluate and select the most proper deliverable of LMMs. Experimental results on two datasets show that 1) our proposed three-level framework can achieve 90.89% F1-score in UVSD dataset and 80.79% F1-score, outperforming state-of-the-art; 2) leveraging LMMs helps to improve the F1-score by 2.25% in UVSD and 3.55% in RSL, compared to using the traditional Facial Action Coding System; 3) purely relying on general-purpose LMMs. is insufficient with 88.73% F1-score in UVSD dataset and 77.48% F1-score in RSL dataset, demonstrating the necessity to combine task-specific dedicated solutions with world knowledge given by LMMs.
Abstract:
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Engagement] Emotional and Social Signals
Relevance To Conference: This paper contains multimodal encoding: 1) Visual Encoding with Slow-Emotion-Fast-Action mechanism and 2) Linguistic Encoding via Knowledge from LMMs. Then, our framework fuses the two modalities for psychological stress detection. We believe this work has the potential to facilitate human well-being research in the multimedia field.
Supplementary Material: zip
Submission Number: 1141
Loading