MSKR: Advancing Multi-modal Structured Knowledge Representation with Synergistic Hard Negative Samples

Shuili Zhang, Hongzhang Mu, Tingwen Liu, Qianqian Tong, Jiawei Sheng

Published: 01 Jan 2024, Last Modified: 18 May 2025CIKM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Despite the notable progress achieved by large-scale vision-language pre-training models in a wide range of multi-modal tasks, their performance often falls short in image-text matching challenges that require an in-depth understanding of structured representations. For instance, when distinguishing between texts or images that are generally similar but have distinct structured knowledge (such as entities and relationships in text, or objects and object attributes in images), the model's capabilities are limited. In this paper, we propose a advancing Multi-modal Structured Knowledge Representation with synergistic hard negative samples (MSKR), thereby significantly improving the model's matching capability for such data. Specifically, our model comprises a structured knowledge-enhanced encoder designed to bolster the structured knowledge inherent in textual data, such as entities, their attributes, and the relationships among these entities as well as structured knowledge within images, focusing on elements like objects and their attributes. To further refine the model's learning process, we produce both image and text challenging negative samples. Extensive experimental evaluations on the Winoground, InpaintCOCO, and MSCOCO benchmark reveal that MSKR significantly outperforms the baseline model, showcasing marked improvements 2.66% on average in structured representation learning compared to the baseline. Moreover, general representation results illustrate that our model not only excels in structured representation learning but also maintains its proficiency in general representation learning.