Abstract: 3D novelty detection plays a crucial role in various real-world applications, especially in safety-critical fields such as autonomous driving and intelligent surveillance systems. However, existing 3D novelty detection methods are constrained by the scarcity of 3D data, which may impede the model's ability to learn adequate representations, thereby impacting detection accuracy. To address this challenge, we propose a Unified Learning Framework (UniL) for facilitating novelty detection. During the pretraining phase, UniL assists the point cloud encoder in learning information from other modalities, aligning visual, textual, and 3D features within the same feature space. Additionally, we introduce a novel Multimodal Supervised Contrastive Loss (MSC Loss) to improve the model's ability to cluster samples from the same category in feature space by leveraging label information during pretraining. Furthermore, we propose a straightforward yet powerful scoring method, Depth Map Error (DME), which assesses the discrepancy between projected depth maps before and after point cloud reconstruction during novelty detection. Extensive experiments conducted on 3DOS have demonstrated the effectiveness of our approach, significantly enhancing the performance of the unsupervised VAE method in 3D novelty detection. The code will be made available.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Our work significantly contributes to multimedia and multimodal processing by proposing a novel approach for point cloud novelty detection through multimodal pre-training. By integrating information from diverse modalities such as textual descriptions and visual data into the pre-training process, our method aligns multimodal features within a unified feature space. This integration enriches the model's understanding of the underlying data, enabling it to capture intricate relationships between different modalities. Consequently, our approach enhances the model's ability to detect novel patterns and novelties in point cloud data, even when confronted with previously unseen samples. By leveraging multimodal pre-training, our method bridges the gap between different data modalities, facilitating more robust and accurate novelty detection in complex real-world scenarios. This advancement not only improves the performance of point cloud analysis but also contributes to the broader field of multimedia and multimodal processing by demonstrating the effectiveness of integrating information from multiple sources for enhanced data analysis and interpretation.
Supplementary Material: zip
Submission Number: 2087
Loading