Swin-MGNet: Swin Transformer Based Multiview Grouping Network for 3-D Object Recognition

Published: 01 Jan 2025, Last Modified: 22 Jul 2025IEEE Trans. Artif. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent developments in Swin Transformer have shown its great potential in various computer vision tasks, including image classification, semantic segmentation, and object detection. However, it is challenging to achieve desired performance by directly employing the Swin Transformer in multiview 3-D object recognition since the Swin Transformer independently extracts the characteristics of each view and relies heavily on a subsequent fusion strategy to unify the multiview information. This leads to the problem of the insufficient extraction of interdependencies between the multiview images. To this end, we propose an aggregation strategy integrated into the Swin Transformer to reinforce the connections between internal features across multiple views, thus leading to a complete interpretation of isolated features extracted by the Swin Transformer. Specifically, we utilize Swin Transformer to learn view-level feature representations from multiview images and then calculate their view discrimination scores. The scores are employed to assign the view-level features to different groups. Finally, a grouping and fusion network is proposed to aggregate the features from view and group levels. The experimental results indicate that our method attains state-of-the-art performance compared with prior approaches in multiview 3-D object recognition tasks. The source code is available at https://github.com/Qishaohua94/DEST.
Loading