Keywords: Emotion Recognition, Valence-Arousal, Image Captioning, BERT, human cognition, Sentiment Analysis
TL;DR: We propose a novel framework for emotion prediction in continuous space (Valence-Arousal) and provide a new benchmark for the community to explore and enhance the human emotion perception.
Abstract: Estimating the perceived emotion to visual stimuli has gained significant traction in the recent years. The existing frameworks rely either on a person's presence in the image or are based on object feature extraction and low-level image features. By focusing on the person/object in the image, the existing frameworks fail to capture the context or the interaction between multiple elements in the image. Also, what if an image does not have a human subject or an object? We address this drawback by building a Cognitive Contextual Summarization (CCS) model based on an One-For-All (OFA) backbone trained on multiple tasks, including image captioning. The ability of the backbone to recognize elements in the image and generate captions helps us capture interactions through captions, which we decode using BERT for contextual understanding. The end-to-end fusion of the OFA and the BERT features helps us predict continuous human emotion (Valence, Arousal) from an image. We train our framework on the Building Emotional Machines dataset in the literature, and the experiments show that our model outperforms the State-of-the-art.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
Supplementary Material: zip
7 Replies
Loading