Abstract: Despite progress in 3D object recognition using deep learning (DL), challenges such as domain shift, occlusion, and viewpoint variations hinder robust performance. Additionally, the high computational cost and lack of labeled data limit real-time deployment in applications such as autonomous driving and robotic manipulation. To address these challenges, we propose 3D-CDNeT, a novel cross-domain deep learning network designed for unsupervised learning, enabling efficient and robust point cloud recognition. At the core of our model is a lightweight graph-infused attention encoder (GIAE) that enables effective feature interaction between the source and target domains. It not only improves recognition accuracy but also reduces inference time, which is essential for real-time applications. To enhance robustness and adaptability, we introduce a feature invariance learning module (FILM) using contrastive loss for learning invariant features. In addition, we adopt a Generative Decoder (GD) based on a Variational Auto-Encoder (VAE) to model diverse latent spaces and reconstruct meaningful 3D structures from the point cloud. This reconstruction process acts as a self-supervised generative objective that complements the discriminative recognition task, guiding the encoder to learn structure-preserving and domain-invariant features that improve recognition under occlusion and cross-domain conditions. Our proposed model unifies generative and discriminative tasks by using self-attention on the object covariance matrix to facilitate efficient information exchange, enabling the extraction of both local and global features. We further develop a self-supervised pretraining strategy that learns both global and local object invariances through GIAE and GD, respectively. A new loss function, combining contrastive loss and Chamfer distance, is proposed to strengthen cross-domain feature alignment. Experimental results on three benchmark datasets demonstrate that 3D-CDNeT outperforms existing state-of-the-art (SOTA) methods in recognition accuracy and inference speed, offering a practical solution for real-time 3D perception tasks. It achieves accuracies of 90.6 % on ModelNet40, 95.2 % on ModelNet10, and 76.4 % on the ScanObjectNN dataset in linear evaluation tasks, all while reducing runtime by 45 % without compromising performance. Detailed qualitative comparisons and ablation studies are provided to validate the effectiveness of each component and demonstrate the superior performance of our proposed method.
External IDs:doi:10.1016/j.neucom.2025.131939
Loading