RoboAfford: A Dataset and Benchmark for Enhancing Object and Spatial Affordance Learning in Robot Manipulation

Yingbo Tang, Lingfeng Zhang, Shuyi Zhang, Yinuo Zhao, Xiaoshuai Hao

Published: 14 Nov 2025, Last Modified: 12 Nov 2025ACM MM 2025EveryoneCC BY 4.0

Abstract: Robot manipulation is a fundamental capability of embodied intelligence, enabling effective robot interactions with the physical world. In robot manipulation tasks, predicting precise grasping positions and object placement is essential. Achieving this requires object recognition to localize target object, predicting object affordances for interaction and spatial affordances for optimal arrangement. While Vision-Language Models (VLMs) provide insights for high-level task planning and scene understanding, they often struggle to predict precise action positions, such as functional grasp points and spatial placements. This limitation stems from the lack of annotations for object and spatial affordance data in their training datasets. To address this gap, we introduce RoboAfford, a novel large-scale dataset designed to enhance object and spatial affordance learning in robot manipulation. Our dataset comprises 819,987 images paired with 1.9 million question answering (QA) annotations, covering three critical tasks: object affordance recognition to identify objects based on attributes and spatial relationships, object affordance prediction to pinpoint functional grasping parts, and spatial affordance localization to identify free space for placement. Complementing this dataset, we propose RoboAfford-Eval, a comprehensive benchmark for assessing affordance-aware prediction in real-world scenarios, featuring 338 meticulously annotated samples across the same three tasks. Extensive experimental results reveal the deficiencies of existing VLMs in affordance learning, while fine-tuning on the RoboAfford dataset significantly enhances their affordance prediction in robot manipulation, validating the dataset's effectiveness. Extensive experimental results show that vision-language models (VLMs) fine-tuned on the RoboAfford dataset significantly enhance their affordance capabilities in robotic manipulation, validating the effectiveness of our dataset. The VLM fine-tuned on RoboAfford also achieves state-of-the-art performance across various tasks on the proposed benchmark, further confirming its impact on affordance reasoning. The dataset, benchmark and evaluation code will be made publicly available to facilitate future research.