Abstract: Artificial intelligence methods offer objectivity and convenience in automatic depression detection, however, current research often neglects the critical role of facial landmarks. This oversight results in insufficient spatial structure information and a lack of detailed local representation, which fails to capture the nuanced semantic information crucial for identifying depression-related clues. To address these issues, we introduce a novel dual-branch network model comprising the Landmark-Image-Landmark Net (LIL Net) and the Global Context Vision Transformer Net (GCVit Net). Through a dual-stream, multiscale, and cross-fusion strategy, LIL Net is designed to extract original facial image features alongside landmark features, prioritizing the detailed semantic information of potential depression clues. LIL Net employs an innovative LIL Attention approach to jointly learn multiscale features from facial landmarks and images, thereby enhancing the model’s ability to capture fine-grained depression-related cues. Furthermore, the Multi-scale Feature Fusion (MSFF) module fuses the obtained multiscale features, augmenting the semantic expression of potential depression clues within facial landmarks via attention mechanisms. Meanwhile, the GCVit Net branch network supplements global information by extracting global facial features. Finally, the features from both branches are concatenated to enhance the accuracy of depression degree predictions. Experimental results demonstrate that our model has superior performance in detecting depression compared to existing methods. We release our code at https://github.com/xlx777/LIL-Net.
Loading