EFN6D: an efficient RGB-D fusion network for 6D pose estimation

Published: 01 Jan 2024, Last Modified: 08 Apr 2025J. Ambient Intell. Humaniz. Comput. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Precise 6DoF (6D) object pose estimation is an essential topic for many intelligent applications, for example, robot grasping, virtual reality, and autonomous driving. Lacking depth information, traditional pose estimators using only RGB cameras consistently predict bias 3D rotation and translation matrices. With the wide use of RGB-D cameras, we can directly capture both the depth for the object relative to the camera and the corresponding RGB image. Most existing methods concatenate these two data sources directly, which does not make full use of their complementary relationship. Therefore, we propose an efficient RGB-D fusion network for 6D pose estimation, called EFN6D, to exploit the 2D–3D feature more thoroughly. Instead of directly using the original single-channel depth map, we encode the depth information into a normal map and point cloud data. To effectively fuse the surface texture features and the geometric contour features of the object, we feed the RGB images and the normal map into two ResNets. Besides, the PSP modules and skip connections are used between the two ResNets, which not only enhances cross modal fusion performance of the network but also enhances the network’s capability in handling objects at different scales. Finally, the fused features obtained from these two ResNets and the point cloud features are densely fused point by point to further strengthen the fusion of 2D and 3D information at a per-pixel level. Experiments on the LINEMOD and YCB-Video datasets show that our EFN6D outperforms state-of-the-art methods by a large margin.
Loading