TTFNet: Temporal-Frequency Features Fusion Network for Speech Based Automatic Depression Recognition and Assessment
Abstract: Related studies have revealed that the phonological features of depressed patients are different from those of healthy individuals. With the increasing prevalence of depression, an objective and convenient approach for early screening is necessary. To this end, we propose an automatic depression detection method based on hybrid speech features extracted by deep learning, dubbed as TTFNet. Firstly, to effectively excavate the intrinsic relationship among multidimensional dynamic features in the frequency domain, the log-Mel spectrogram of raw speech and its related derivatives are encoded into quaternion representation. Then, the innovatively designed quaternion VisionLSTM is utilized to capture their synergistic effects. Simultaneously, we integrate sLSTM with the pre-trained wav2vec 2.0 model to fully acquire the temporal features. In addition, to further exploit the complementarity between temporal and frequency features, we design an XConformer block for cross-sequence interactions, which ingeniously combines self-attention mechanisms and convolutional modules. Based on this block, the dual-path fusion module closely utilizes the mutual promotion of features from different domains, thereby enhancing generalization capability of the proposed model. Extensive experiments conducted on the AVEC 2013, AVEC 2014, DAIC-WOZ and E-DAIC datasets demonstrate that our method outperforms current state-of-the-art methods in both depression recognition and severity prediction tasks.
External IDs:dblp:journals/titb/ChenSJCWLNCHWYS25
Loading