Abstract: LiDAR-based 3D detection, as an essential technique in multimedia applications such as augmented reality and autonomous driving, has made great progress in recent years. However, the performance of a well trained 3D detectors is considerably graded when deployed in unseen environments due to the severe domain gap. Traditional unsupervised domain adaptation methods, including co-training and mean-teacher frameworks, do not effectively bridge the domain gap as they struggle with noisy and incomplete pseudo-labels and the inability to capture domain-invariant features. In this work, we introduce a novel Co-training Mean-Teacher (CMT) framework for unsupervised domain adaptation in 3D object detection. Our framework enhances adaptation by leveraging both source and target domain data to construct a hybrid domain that aligns domain-specific features more effectively. We employ hard instance mining to enrich the target domain feature distribution and utilize class-aware contrastive learning to refine feature representations across domains. Additionally, we develop batch adaptive normalization to fine-tune the batch normalization parameters of the teacher model dynamically, promoting more stable and reliable learning. Extensive experiments across various benchmarks, including Waymo, nuScenes and KITTI, demonstrate the superiority of our CMT over the state-of-the-art approaches in different adaptation scenarios.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This work contributes to the multimedia and multimodal processing fields by addressing a key challenge in 3D object detection across different environmental conditions. Its innovative adaptation strategies facilitate seamless integration of data across domains, ensuring high system performance even in previously unseen environments. This capability is crucial for applications like autonomous vehicles, which must interpret complex scenes in real time, and augmented reality systems, where accurate 3D object detection supports user interaction and immersion in digitally enhanced environments. Moreover, by improving feature discrimination across domains, our work lays the foundation for more sophisticated multimedia systems capable of understanding and interacting with the physical world in a nuanced and reliable manner. This advancement opens new possibilities for deploying multimedia technologies in various applications, from navigation aids for the visually impaired to more interactive and responsive smart city infrastructures, showcasing the broad applicability and relevance of our contribution to multimedia and multimodal processing.
Supplementary Material: zip
Submission Number: 4670
Loading