Abstract: Due to the explosive growth in data sources and label categories,
multi-view multi-label learning has garnered widespread attention.
However, multi-view multi-label data often exhibits incomplete
features and few labeled instances alongside a huge number of unlabeled instances, due to the technical limitations of data collection
and high annotation cost of manual labeling in practice. Learning
for such simultaneous missing of view features and labels is crucial
but rarely studied, particularly when the labeled samples with full
observations are limited. In this paper, we tackle this problem by
proposing a novel Deep Incomplete Multi-View Semi-Supervised
Multi-Label Learning method (DIMvSML). Specifically, to improve
high-level representations of missing features, DIMvSML firstly employs deep graph networks to recover the feature information with
structural similarity relations. Meanwhile, we design the structure-specific deep feature extractors to obtain discriminative information
and preserve the cross-view consistency for the recovered data with
instance-level contrastive loss. Furthermore, to eliminate the bias
of the estimate of the risk that the semi-supervised multi-label
methods minimise, we design a safe risk estimate framework with
an unbiased loss and improve its empirical performance by using pseudo-labels of unlabeled data. Besides, we provide both the
theoretical proof of better estimate variance and the intuitive explanation of our debiased framework. Finally, extensive experimental
results on public datasets validate the superiority of DIMvSML
compared with state-of-the-art methods.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Systems] Data Systems Management and Indexing, [Content] Media Interpretation
Relevance To Conference: The object we deal with is multi-view multi-label data. Our work based on deep learning can handle both missing views and few labeled instances. In multi-view multimedia annotation tasks, video, audio, and subtitle serve as distinct views. It is common to face situations where
not all multi-media content encompasses all three views and many instances may lack annotations due to resource constraints and annotation complexity. For such scenarios, our approach has been proven to work well by experiments. Therefore, our method has strong ability to process multimedia data, such as images, text, and audio, from real-world applications.
Supplementary Material: zip
Submission Number: 3843
Loading