Facial Depression Estimation via Multi-Cue Contrastive Learning

Published: 2025, Last Modified: 27 Jan 2026IEEE Trans. Circuits Syst. Video Technol. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Vision-based depression estimation is an emerging yet impactful task, whose challenge lies in predicting the severity of depression from facial videos lasting at least several minutes. Existing methods primarily focus on fusing frame-level features to create comprehensive representations. However, they often overlook two crucial aspects: 1) inter- and intra-cue correlations, and 2) variations among samples. Hence, simply characterizing sample embeddings while ignoring to mine the relation among multiple cues leads to limitations. To address this problem, we propose a novel Multi-Cue Contrastive Learning (MCCL) framework to mine the relation among multiple cues for discriminative representation. Specifically, we first introduce a novel cross-characteristic attentive interaction module to model the relationship among multiple cues from four facial features (e.g., 3D landmarks, head poses, gazes, FAUs). Then, we propose a temporal segment attentive interaction module to capture the temporal relationships within each facial feature over time intervals. Moreover, we integrate contrastive learning to leverage the variations among samples by regarding the embeddings of inter-cue and intra-cue as positive pairs while considering embeddings from other samples as negative. In this way, the proposed MCCL framework leverages the relationships among the facial features and the variations among samples to enhance the process of multi-cue mining, thereby achieving more accurate facial depression estimation. Extensive experiments on public datasets, DAIC-WOZ, CMDC, and E-DAIC, demonstrate that our model not only outperforms the advanced depression methods but that the discriminative representations of facial behaviors provide potential insights about depression. Our code is available at: https://github.com/xkwangcn/MCCL.git
Loading