Abstract: Multimodal aspect-based sentiment classification (MASC) is an emerging task due to an increase in user-generated multimodal content on social platforms, aimed at predicting sentiment polarity toward specific aspect targets (i.e., entities or attributes explicitly mentioned in text-image pairs). Despite extensive efforts and significant achievements in existing MASC, substantial gaps remain in understanding fine-grained visual content and the cognitive rationales derived from semantic content and impressions (cognitive interpretations of emotions evoked by image content). In this study, we present Chimera: a cognitive and aesthetic sentiment causality understanding framework to derive fine-grained holistic features of aspects and infer the fundamental drivers of sentiment expression from both semantic perspectives and affective-cognitive resonance (the synergistic effect between emotional responses and cognitive interpretations). The framework aligns visual patches with words, extracts coarse and fine-grained visual features, translates them into textual descriptions, and uses LLM-generated sentimental causes and impressions to boost sensitivity to affective cues. Experiments on MASC datasets show the model’s effectiveness and greater flexibility compared to LLMs like GPT-4o.
External IDs:dblp:journals/taffco/XiaoMZLJHC25
Loading