Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Emotion recognition aims to discern the emotional state of subjects within an image, relying on subject-centric and contextual visual cues. Current approaches typically follow a two-stage pipeline: first localize subjects by off-the-shelf detectors, then perform emotion classification through the late fusion of subject and context features. However, the complicated paradigm suffers from disjoint training stages and limited interaction between fine-grained subject-context elements. To address the challenge, we present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT), for simultaneous subject localization and emotion classification. Rather than compartmentalizing training stages, we jointly leverage box and emotion signals as supervision to enrich subject-centric feature learning. Furthermore, we introduce DSCT to facilitate interactions between fine-grained subject-context cues in a ``decouple-then-fuse'' manner. The decoupled query tokens—subject queries and context queries—gradually intertwine across layers within DSCT, during which spatial and semantic relations are exploited and aggregated. We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC. Our approach surpasses two-stage alternatives with fewer parameter numbers, achieving a 3.39% accuracy improvement and a 6.46% average precision gain on CAER-S and EMOTIC datasets, respectively.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Automatic human emotion recognition gets increasing research attention in the multimedia community, where studies include inferring emotions from speech [1, 2], image [3, 4], and multi-modalities [5, 6]. Its potential applications span across healthcare, driver surveillance, and diverse human-computer interaction systems, reflecting the fundamental role of emotions in daily communication. In this paper, we focus on the problem of inferring the emotion of one person in a real-world image. Concretely, given an in-the-wild image, we aim to identify the subject's apparent discrete emotion categories (e.g. happy, sad, fearful, or neutral). Current approaches typically follow a two-stage pipeline: first localize subjects by off-the-shelf detectors, then perform emotion classification through the late fusion of subject and context features. However, the complicated paradigm suffers from disjoint training stages and limited fine-grained interaction between subject-context elements. To address the challenge, we present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT), for simultaneous subject localization and emotion classification. Reference [1] Jiaxin Ye, Yujie Wei, Xin-Cheng Wen, Chenglong Ma, Zhizhong Huang, Kunhong Liu, and Hongming Shan. 2023. Emo-DNA: Emotion Decoupling and Alignment Learning for Cross-Corpus Speech Emotion Recognition. In Proceedings of the 31st ACM International Conference on Multimedia. 5956–5965. [2] Shiqing Zhang, Shiliang Zhang, Tiejun Huang, and Wen Gao. 2017. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multimedia 20, 6 (2017), 1576–1590. [3] Mijanur Palash and Bharat Bhargava. 2023. EMERSK-Explainable Multimodal Emotion Recognition with Situational Knowledge. IEEE Transactions on Multimedia (2023). [4] Haimin Zhang and Min Xu. 2021. Recognition of emotions in user-generated videos through frame-level adaptation and emotion intensity learning. IEEE Transactions on Multimedia 25 (2021), 881–891. [5] Dung Nguyen, Duc Thanh Nguyen, Rui Zeng, Thanh Thi Nguyen, Son N Tran, Thin Nguyen, Sridha Sridharan, and Clinton Fookes. 2021. Deep auto-encoders with sequential learning for multimodal dimensional emotion recognition. IEEE Transactions on Multimedia 24 (2021), 1313–1324. [6] Weizhi Nie, Minjie Ren, Jie Nie, and Sicheng Zhao. 2020. C-GCN: Correlation based graph convolutional network for audio-video emotion recognition. IEEE Transactions on Multimedia 23 (2020), 3793–3804.
Supplementary Material: zip
Submission Number: 1273
Loading