Keywords: 3d Affordance, VLM, GRPO, robotic
Abstract: Task-driven affordance grounding in 3D scenes is crucial for embodied AI agents to identify and operate functional interactive elements (e.g., switches, hinges, handles) and thereby accomplish their objectives. However, current approaches have the following limitations: purely 3D point cloud pipelines struggle to generalize across scenes and categories, while 2D-driven methods guided by generic vision–language models often miss small, functionally distinct parts and produce view-dependent, inconsistent results. We introduce ThinkAfford, a coarse-to-fine RGB-D framework for grounding natural-language instructions to fine-grained 3D affordances in cluttered scenes. The coarse stage uses vision-language reasoning to efficiently prune thousands of frames to a compact set of relevant candidate views, leveraging context and relational cues to avoid exhaustive search. The fine stage then focuses on functional parts: it produces affordance-centric proposals that remain stable across viewpoints, and employs an instruction-guided selector fine-tuned with Group Relative Policy Optimization (GRPO) to enhance fine-grained spatial reasoning, by explicitly rewarding choices that satisfy attribute, relational, and geometric constraints. Experiments on SceneFun3D demonstrate state-of-the-art performance, achieving 14.97% AP25 on the test split—a 70.1% relative improvement over the previous SOTA method. Our results show that this structured decomposition, combined with fine-grained spatial reasoning, effectively bridges the gap between high-level language understanding and precise 3D affordance localization. The code will be made available for future exploration.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8679
Loading