Abstract: Existing grasp detection methods usually rely on data-driven strategies to learn grasping features from labeled data, restricting their generalization to new scenes and objects. Preliminary researches introduce domain-invariant methods which tend to simply consider single visual representations and ignore the widely existing graspable priors shared by objects from multiple domains. To solve this, we propose a novel margin and structure prior guided grasp detection network to effectively extract domain-invariant grasping features, leading to more accurate grasps. The structure prior leverages cross-attention to aggregate graspable structural features across multiple scenes into a scene-invariant structural representation. The margin prior employs cosine similarity to measure the difference between the foreground and background at the object grasp boundary. Extensive comparative experiments under public datasets (single domain, cross domains, and stricter metrics) and real-world scenarios are thoroughly deployed to prove the superiority of the proposed method, especially for complicated backgrounds and cluttered objects. Under cross-domain scenarios, the average improvements caused by prior information for existing methods are 6.87% and 9.53% on VMRD and GraspNet datasets. Moreover, under stricter metrics, our MSPNet outperforms SOTA methods by 9.0% and 12.9% on Cornell and Jacquard datasets.
Loading