Abstract: Several studies produced sophisticated models for
sentiment analysis of textual data, and many others tackled
feature extraction from images. However, far fewer studies focus
on the multimodal representation of data, namely the information
that consists of multiple channels. In this work, we focus on the
classification problem of multimodal data. Memes comprise a
visual image and a textual caption. This work is dedicated to
classifying hateful memes and this work proposes two approaches
to solve the multimodal classification problem. First, converting
the visual channel into a textual one and feed it to textual
classifiers. The other approach, which yielded superior results,
converted both channels into a vector representation and then
combined them to represent the visual-textual context. This work
is a consequence of the Facebook Hateful Memes challenge.
The model developed in this work managed to rank 32 among
3172 competitors in the challenge. The model is implemented
with no domain knowledge or understanding of hate speech.
This model performed well in the Facebook Hateful Memes
challenge dataset and a novel dataset that we created to prove
the consistency of generic models over other models that are
structured according to domain knowledge. In contrast to the top
solution in the Facebook Memes Challenge, this work provides a
generic approach, without hard-coding rules ahead of training or
validation, that is able to learn the hatefulness definition from any
dataset. A novel dataset that comprises hateful memes retrieved
randomly from the web is described in this work, which is used
as another dataset to test approaches generality.
Loading