Abstract: The multimodal aspect-based sentiment classification task (mABSC) aims to recognize the sentiment polarities of aspect entities according to the associated textual and visual resources. Inspired by the cross-modal alignment ability of Transformers, some recent mABSC methods have also proposed to use Transformers to discover relevance between aspect entity and cross-modal visual regions, based on which relevant visual regions can be identified and leveraged to help recognize the sentiment polarity of aspect entities. However, due to the limited training data in mABSC tasks, it is found that in the Transformers, the aspect entities are often attended to irrelevant visual information, which obviously will not benefit the sentiment polarity recognition. To address the issue, we seek help from external knowledge, including textual syntax and cross-modal relevancy knowledge. The basic idea is to cut off the irrelevant connections among textual or cross-modal modalities in the Transformer layer using a knowledge-induced matrix. To have the matrix going beyond simply capturing direct relations, a mechanism is developed to enable it to reflect multi-hop relations, followed by a discretization operation to filter out extreme relevancy. Extensive experiments on two public multimodal datasets show that our method outperforms all competing baselines. Further studies demonstrate the effectiveness of each component, and suggest that the introduced external knowledge can instruct the model to learn the correct relevance among textual or cross-modal features, thus benefiting the mABSC task.
Loading