Abstract: Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial semantic association. The code is available at https://anonymous.4open.science/r/EmDepart.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: (1) We propose a novel network that partially aligns textual and visual space according to their semantic-relevant degree. This addresses the suboptimal alignment caused by ignoring the incomplete semantic equivalence between documents and images. It sheds new light on the vision-and-language partial semantic alignment.
(2) To alleviate the problem of information redundancy caused by feature collapse (multiple embeddings with a slight variance), we introduce the semantic decomposition module with the local-to-semantic variance loss to capture unique local details and multiple semantic diversity loss to enhance orthogonality among the embeddings. These losses also improve the performance of previous methods.
(3) The partial association (the semantic content on the textual side is not equivalent within the image) is common in many vision-language tasks, such as zero-shot image classification and cross-modal retrieval. Our method is task-agnostic, and losses are model-agnostic, which makes it easy to transfer to other domains.
Supplementary Material: zip
Submission Number: 549
Loading