Abstract: With the popularity and widespread use of social media platforms, such as Twitter and Facebook, massive amounts of text and image information posted by a variety of users have flooded these social media platforms. Thus, multimodal named entity recognition (MNER) has become a research hotspot for the task of extracting named entities from multimodal data. Empirically, the visual clues unrelated to text data may introduce uncertain or even negative impacts on the named entity recognition. The considerations of the relevance of multimodal data have been ignored in the previous studies. In this article, to effectively measure the relationship between text data and visual cues for improving the accuracy of named entities, we propose a text-image scene graph fusion (TISGF) approach with a text-image similarity assessment module (TISA) and a text-image fusion module (TIF) for MNER. Specifically, we first construct two (visual and textual) scene graphs to exploit the joint features of objects and relations in text and image and encode the two scene graphs separately using a specific encoder pair. In this way, we can obtain both object-level and relationship-level cross-modal features. Subsequently, TISA is used to compute the similarity of the image and text data and to determine the proportion of visual information that will be retained for fusion. Finally, we use TIF to achieve a unified multimodal representation of each word and predict the entity type using conditional random fields. Extensive experiment results on two public datasets demonstrate the effectiveness and competitiveness of our proposed method for the MNER task.
Loading