AffoGato: Learning Open-Vocabulary Affordance Grounding with Foundation Models

AffoGato: Learning Open-Vocabulary Affordance Grounding with Foundation Models

ICLR 2026 Conference Submission14674 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Affordance, Open-Vocabulary, Vision-Language Models, Foundation Models

TL;DR: We introduce AffoGato, an open-vocabulary affordance grounding framework with three stages: automatic generation of Affo-150K, pretraining Gato-3D/2D models on this data, and fine-tuning that demonstrates strong open-vocabulary capabilities.

Abstract: Affordance grounding - localizing object regions based on natural language descriptions of interactions - is a critical challenge for enabling intelligent agents to understand and interact with their environments. However, this task remains challenging due to the need for fine-grained localization, the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets. We introduce AffoGato, a unified framework for open-vocabulary affordance grounding across both 3D and 2D. Our approach leverages supervision from foundation models to automatically generate scalable affordance annotations, enabling training without reliance on exhaustive manual labeling. As part of this pipeline, we construct Affo-150K, a large automatically generated dataset of 150K 3D object instances with free-form affordance descriptions and corresponding 3D affordance heatmaps. Within AffoGato, we design simple yet effective models, Gato-3D and Gato-2D, by combining pre-trained part-aware vision encoders with text-conditional heatmap decoders. Our models achieve state-of-the-art performance across existing 3D and 2D benchmarks, with pretraining on Affo-150K further enhancing their open-vocabulary capabilities.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 14674

Loading