Causal Visual-semantic Correlation for Zero-shot Learning

Shuhuang Chen; Dingjie Fu; Shiming Chen; shuo Ye; Wenjin Hou; Xinge You

Causal Visual-semantic Correlation for Zero-shot Learning

Shuhuang Chen, Dingjie Fu, Shiming Chen, shuo Ye, Wenjin Hou, Xinge You

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Zero-Shot learning (ZSL) correlates visual samples and shared semantic information to transfer knowledge from seen classes to unseen classes. Existing methods typically establish visual-semantic correlation by aligning visual and semantic features, which are extracted from visual samples and semantic information, respectively. However, instance-level images, owing to singular observation perspectives and diverse individuals, cannot exactly match the comprehensive semantic information defined at the class level. Direct feature alignment imposes correlation between mismatched vision and semantics, resulting in spurious visual-semantic correlation. To address this, we propose a novel method termed Causal Visual-semantic Correlation (CVsC) to learn substantive visual-semantic correlation for ZSL. Specifically, we utilize a Visual Semantic Attention module to facilitate interaction between vision and semantics, thereby identifying attribute-related visual features. Furthermore, we design a Conditional Correlation Loss to properly utilize semantic information as supervision for establishing visual-semantic correlation. Moreover, we introduce counterfactual intervention applied to attribute-related visual features, and maximize their impact on semantic and target predictions to enhance substantive visual-semantic correlation. Extensive experiments conducted on three benchmark datasets (i.e., CUB, SUN, and AWA2) demonstrate that our CVSC outperforms existing state-of-the-art methods.

Primary Subject Area: [Content] Vision and Language

Secondary Subject Area: [Content] Multimodal Fusion

Relevance To Conference: Zero-shot learning (ZSL) realizes knowledge transfer through multi-modal interaction, i.e. visual and semantic interaction, to directly identify unseen class samples. Our work explores the spurious visual-semantic correlation caused by the substantial discrepancy between visual and semantic information, and proposes Causal Visual-semantic Correlation (CVsC) to solve the issue. We formulate effective visual-semantic correlation as the objective and introduce counterfactual causal interference for further refinement. Our study inspires multimodal processing to achieve more effective modal interactions by eliminating spurious correlations. Notably, there exists a series of papers about ZSL published on ACM MM.

Supplementary Material: zip

Submission Number: 1530

Loading