Robust Variational Contrastive Learning for Partially View-unaligned Clustering

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Although multi-view learning has achieved remarkable progress over the past decades, most existing methods implicitly assume that all views (or modalities) are well-aligned. In practice, however, collecting fully aligned views is challenging due to complexities and discordances in time and space, resulting in the Partially View-unaligned Problem (PVP), such as audio-video asynchrony caused by network congestion. While some methods are proposed to align the unaligned views by learning view-invariant representations, almost all of them overlook specific information across different views for complementarity, limiting performance improvement. To address these problems, we propose a robust framework, dubbed \textbf{V}ariat\textbf{I}onal Con\textbf{T}r\textbf{A}stive \textbf{L}earning (VITAL), designed to learn both common and specific information simultaneously. To be specific, each data sample is first modeled as a Gaussian distribution in the latent space, where the mean estimates the most probable common information, while the variance indicates view-specific information. Second, by using variational inference, VITAL conducts intra- and inter-view contrastive learning to preserve common and specific semantics in the distribution representations, thereby achieving comprehensive perception. As a result, the common representation (mean) could be used to guide category-level realignment, while the specific representation (variance) complements sample semantic information, thereby boosting overall performance. Finally, considering the abundance of False Negative Pairs (FNPs) generated by unsupervised contrastive learning, we propose a robust loss function that seamlessly incorporates FNP rectification into the contrastive learning paradigm. Empirical evaluations on eight benchmark datasets reveal that VITAL outperforms ten state-of-the-art deep clustering baselines, demonstrating its efficacy in both partially and fully aligned scenarios. The Code is available at \url{https://github.com/He-Changhao/2024-MM-VITAL}.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Media Interpretation
Relevance To Conference: This work contributes to multimedia/multimodal processing in several significant ways: 1. Addressing the Partially View-unaligned Problem (PVP): By focusing on the challenge of unaligned views, such as audio-video asynchrony, this work tackles a common and practical issue in multimedia processing that affects the synchronization and quality of multimodal data. 2. Variational Contrastive Learning: The proposed VITAL framework introduces a novel approach by combining variational inference and contrastive learning. This method allows for the learning of both common and specific properties of data, which is crucial for handling the diversity and complexity of multimedia content. 3. Enhanced Data Realignment: By using the mean and variance of the posterior Gaussian distribution to guide category-level realignment and complement sample information, VITAL offers a more nuanced way of aligning multimodal data, potentially leading to more accurate multimedia analysis. 4. Robust Loss Function: The introduction of a robust loss function to rectify False Negative Pairs (FNPs) addresses a significant challenge in unsupervised learning, which is particularly relevant for large-scale, unlabeled multimedia datasets. 5. Empirical Validation: The extensive experiments conducted on eight benchmark datasets and the comparison with ten state-of-the-art deep clustering baselines provide strong empirical evidence of VITAL's effectiveness in both partially and fully aligned scenarios. Overall, this work pushes the boundaries of current methodologies in multimedia/multimodal processing by providing a more robust and nuanced framework for dealing with unaligned multimodal data, which could lead to significant improvements in the field.
Supplementary Material: zip
Submission Number: 3430
Loading