Visual-linguistic Cross-domain Feature Learning with Group Attention and Gamma-correct Gated Fusion for Extracting Commonsense Knowledge

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Acquiring commonsense knowledge about entity-pairs from images is crucial across diverse applications. Distantly supervised learning has made significant advancements by automatically retrieving images containing entity pairs and summarizing commonsense knowledge from the bag of images. However, the retrieved images may not always cover all possible relations, and the informative features across the bag of images are often overlooked. To address these challenges, a Multi-modal Cross-domain Feature Learning framework is proposed to incorporate the general domain knowledge from a large vision-text foundation model, ViT-GPT2, to handle unseen relations and exploit complementary information from multiple sources. Then, a Group Attention module is designed to exploit the attentive information from other instances of the same bag to boost the informative features of individual instances. Finally, a Gamma-corrected Gated Fusion is designed to select a subset of informative instances for a comprehensive summarization of commonsense entity relations. Extensive experimental results demonstrate the superiority of the proposed method over state-of-the-art models for extracting commonsense knowledge.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: Commonsense Knowledge Extraction (CKE) is crucial in multimedia/multimodal processing as it enhances understanding, enriches context, enables semantic inference, handles ambiguity, and ensures real-world relevance of the analyzed data. By leveraging commonsense knowledge, multimedia processing algorithms achieve higher sophistication and accuracy in various tasks and domains. However, existing visually grounded CKE approaches often face limitations. These include the reliance on limited retrieved images that may not cover all possible relations between queried entity pairs, as well as mislabeled images in automatically constructed datasets using distant supervision techniques. To address these issues, this work introduces a Multi-modal Cross-domain Feature Learning framework to extract visual-linguistic features, a Group Attention module to enhance individual instance features, and a Gamma-corrected Gated Fusion module to combine these instances. Experimental results demonstrate the superiority of the proposed method over state-of-the-art CKE models.
Supplementary Material: zip
Submission Number: 492
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview