Deep Incomplete Multi-View Network Semi-Supervised Multi-Label Learning with Unbiased Loss

Published: 20 Jul 2024, Last Modified: 02 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Due to the explosive growth in data sources and label categories, multi-view multi-label learning has garnered widespread attention. However, multi-view multi-label data often exhibits incomplete features and few labeled instances alongside a huge number of unlabeled instances, due to the technical limitations of data collection and high annotation cost of manual labeling in practice. Learning for such simultaneous missing of view features and labels is crucial but rarely studied, particularly when the labeled samples with full observations are limited. In this paper, we tackle this problem by proposing a novel Deep Incomplete Multi-View Semi-Supervised Multi-Label Learning method (DIMvSML). Specifically, to improve high-level representations of missing features, DIMvSML firstly employs deep graph networks to recover the feature information with structural similarity relations. Meanwhile, we design the structure-specific deep feature extractors to obtain discriminative information and preserve the cross-view consistency for the recovered data with instance-level contrastive loss. Furthermore, to eliminate the bias of the estimate of the risk that the semi-supervised multi-label methods minimise, we design a safe risk estimate framework with an unbiased loss and improve its empirical performance by using pseudo-labels of unlabeled data. Besides, we provide both the theoretical proof of better estimate variance and the intuitive explanation of our debiased framework. Finally, extensive experimental results on public datasets validate the superiority of DIMvSML compared with state-of-the-art methods.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Systems] Data Systems Management and Indexing, [Content] Media Interpretation
Relevance To Conference: The object we deal with is multi-view multi-label data. Our work based on deep learning can handle both missing views and few labeled instances. In multi-view multimedia annotation tasks, video, audio, and subtitle serve as distinct views. It is common to face situations where not all multi-media content encompasses all three views and many instances may lack annotations due to resource constraints and annotation complexity. For such scenarios, our approach has been proven to work well by experiments. Therefore, our method has strong ability to process multimedia data, such as images, text, and audio, from real-world applications.
Supplementary Material: zip
Submission Number: 3843
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview