CTNet: A CNN-Transformer Hybrid Network for 6D Object Pose Estimation

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: neural networks, 6D pose estimation, RGB-D image
Abstract:

Recent advances in 6D pose estimation primarily rely on CNNs, but they struggle to grasp long-range dependencies and the global context, which are essential for precise pose determination. Although deeper or expanded networks are commonly used to tackle this, they lead to significant computational burdens without fully addressing these constraints. To overcome these challenges, we present CTNet, a hybrid network that fuses the strengths of CNN and Transformer, aiming for accurate 6D pose estimation from a solitary RGB-D image. CTNet employs Transformer to capture elusive long-range dependencies and the global context, while lightweight CNNs adeptly extract detailed local features. This complementary approach offers a comprehensive feature representation, eliminating the necessity for excessively deep networks. To further bolster the CNNs' efficiency, we introduce the Hierarchical Feature Extractor (HFE), which enhances the C2f and ELAN modules for optimal feature extraction. Additionally, we integrate a CNN-based PointNet module, designed to extract vital spatial data from the point cloud. The Transformer element captures global contextual insights, which are then seamlessly integrated with the local and spatial features extracted by the CNNs to ensure precise 6D pose estimation. Experiments demonstrate that CTNet achieves high accuracy with nearly half the FLOPs of current methods on the LineMOD and YCB-Video datasets. Furthermore, the HFE is highly adaptable, showing excellent transferability across other 6D pose estimation architectures.

Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6795
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview