Abstract: Human-assistant robots must understand human-object interactions to execute collaborative manipulation tasks described in natural language. While affordance learning addresses this need, current approaches face a fundamental trade-off: 2D methods capture action-relevant object semantics but lack robust 3D geometric reasoning, whereas 3D methods demand labor-intensive point cloud datasets. To bridge this gap, we propose a one-shot 3D affordance learning framework that grounds action verbs directly into 3D Gaussian Splatting representations. Our key insight is that verb-centric affordances from sparse 2D signals can be lifted into view-consistent 3D representations without explicit 3D annotations. Specifically, our framework requires only a single affordance-labeled image per scene during training, and zero reference images at inference. Operating seamlessly with natural language, this approach eliminates the need for explicit object- or part-level queries, enabling rapid inference crucial for multi-stage tasks. Extensive real world experiments demonstrate that our approach outperforms baselines reliant on 2D affordances or part-level reasoning, particularly in challenging long-horizon multi-stage settings.
Loading