StereoAnything: Advanced Zero-Shot Stereo Imaging for Multi-Finger Grasp Detection with Transparent Objects

Kaixin Bai, Lei Zhang, Yiwen Liu, Zhaopeng Chen, Jianwei Zhang

Published: 01 May 2025, Last Modified: 12 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Grasping transparent objects remains challenging for robotic systems due to their reflective and refractive properties, which distort depth perception and introduce background noise. Unlike humans, who leverage life experience to perceive depth intuitively, robotic algorithms often fail to generalize across different object types. To address this, we propose a novel framework inspired by human perception for grasping transparent objects. Our approach extends features extracted by foundation models to implicitly learn reconstruction strategies for transparent objects without requiring segmentation priors. Crucially, our framework maintains strong performance across all types of objects and scenes, preventing catastrophic forgetting of opaque objects while learning to perceive transparent ones. By integrating affordance information, our method dynamically guides a five-finger dexterous hand to execute diverse grasping strategies based on human intent. To tackle the challenge of annotating transparent objects, we constructed a large-scale synthetic dataset with depth information, affordance data, and automated annotations. Our framework demonstrates strong generalization, achieving a 96% grasp success rate in real-world robotic experiments and proving its broad applicability across varied environments.

External IDs:doi:10.36227/techrxiv.174612328.83478240/v1