VIMEX: A Memory-Centered Task Description Framework for Vision-Based Robotics

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: task description, vision-based robotics, affordances, part detection, object recognition
TL;DR: A framework for describing vision-based tasks to a robot quickly (without training a neural network), easily (without specialized equipment such as optitrack), and efficiently ($\sim$10 RGB images with scribble annotations are sufficient)
Abstract: Robotics holds the potential to automate applications such as farming, construction, and elderly care; making food, shelter, and dignity easily accessible for everyone. This moonshot goal requires deploying robots in environments that are a priori unknown and typically uninstrumented (e.g., without optitrack, external reward/reset mechanisms, or digital twins), such as agricultural fields, construction sites, or private dwellings. It also requires the same robot to perform numerous different tasks within such environments, with each task defining its own notions of what an object is and what constitutes a desirable way of interacting with it (i.e., affordances). Motivated by these considerations, this paper presents a task-description framework called Vimex (i.e., Visual Memex) that allows a user to efficiently describe vision-based robotics tasks and the associated objects, parts, and affordances without requiring specialized equipment or training a deep neural network. Within this framework, arbitrary object definitions; anywhere on the spectrum between specific instances to general categories; are established using a small number of RGB images captured by a consumer camera, while part definitions are established using scribble annotations over these RGB images. Arbitrary metadata (i.e., any form of task-relevant information) are then attached to these annotations to form records stored in a memory. Given an RGBD image of a scene, these records are retrieved to define probability distributions of part locations and metadata over 3D coordinates using an association process based on nearest-neighbors. Finally, affordance definitions are established as probabilistic inference routines conditioned on such part and metadata distributions. To demonstrate what these abstractions mean and how they can be used to describe tasks to a robot, experiments that focus on vision-based grasping are presented.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3845
Loading