Keywords: Affordance, Open-Vocabulary, Vision-Language Models, Foundation Models
TL;DR: We introduce AffoGato, an open-vocabulary affordance grounding framework with three stages: automatic generation of Affo-150K, pretraining Gato-3D/2D models on this data, and fine-tuning that demonstrates strong open-vocabulary capabilities.
Abstract: .
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 14674
Loading