Abstract: With the increasing popularity of online social applications, stickers have become common in online chats. Teaching a model to select the appropriate sticker from a set of candidate stickers based on dialogue context is important for optimizing the user experience.
Existing methods have proposed leveraging emotional information to facilitate the selection of appropriate stickers. However, considering the frequent co-occurrence among sticker images, words with emotional preference in the dialogue and emotion labels, these methods tend to over-rely on such dataset bias, inducing spurious correlations during training. As a result, these methods may select inappropriate stickers that do not match users' intended expression. In this paper, we introduce a causal graph to explicitly identify the spurious correlations in the sticker selection task. Building upon the analysis, we propose a Causal Knowledge-Enhanced Sticker Selection model to mitigate spurious correlations. Specifically, we design a knowledge-enhanced emotional utterance extractor to identify emotional information within dialogues. Then an interventional visual feature extractor is employed to obtain unbiased visual features, aligning them with the emotional utterances representation. Finally, a standard transformer encoder fuses the multimodal information for emotion recognition and sticker selection. Extensive experiments on the MOD dataset show that our CKS model significantly outperforms the baseline models.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Experience] Multimedia Applications, [Content] Vision and Language, [Engagement] Emotional and Social Signals
Relevance To Conference: This work significantly advances the field of multimedia/multimodal processing by addressing the challenge of selecting contextually appropriate stickers in online communications. Stickers, as a form of non-verbal communication in digital environments, enhance user interaction by allowing emotions and reactions to be expressed visually. Current approaches in sticker selection typically leverage emotional cues from text to match stickers but often fall prey to dataset biases where certain emotions are adversely linked with specific stickers. Our contribution is the development of a Causal Knowledge-Enhanced Sticker Selection (CKS) model that uses a novel causal graph to identify and mitigate spurious correlations between sticker images and emotional text. By employing this model, we enhance the accuracy of context-aware sticker recommendations, ensuring that the selected stickers align more closely with the user's intended emotional expression. This aligns with the growing need for sophisticated tools in multimedia processing that enhance user experience by understanding and integrating multiple modalities seamlessly.
Supplementary Material: zip
Submission Number: 4488
Loading