Abstract: We propose a novel approach to multimodal sentiment analysis using deep neural
networks combining visual recognition and natural language processing. Our
goal is different than the standard sentiment analysis goal of predicting whether
a sentence expresses positive or negative sentiment; instead, we aim to infer the
latent emotional state of the user. Thus, we focus on predicting the emotion word
tags attached by users to their Tumblr posts, treating these as “self-reported emotions.”
We demonstrate that our multimodal model combining both text and image
features outperforms separate models based solely on either images or text. Our
model’s results are interpretable, automatically yielding sensible word lists associated
with emotions. We explore the structure of emotions implied by our model
and compare it to what has been posited in the psychology literature, and validate
our model on a set of images that have been used in psychology studies. Finally,
our work also provides a useful tool for the growing academic study of images—
both photographs and memes—on social networks.
Code: [![github](/images/github_icon.svg) anthonyhu/tumblr-emotions](https://github.com/anthonyhu/tumblr-emotions) + [![Papers with Code](/images/pwc_icon.svg) 1 community implementation](https://paperswithcode.com/paper/?openreview=BJ6anzb0Z)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/multimodal-sentiment-analysis-to-explore-the/code)
8 Replies
Loading