Learning Generalizable Visual Task Through Interaction

Published: 28 Oct 2023, Last Modified: 04 Dec 2023GenPlan'23EveryoneRevisionsBibTeX
Abstract: We present a framework for robots to learn novel visual concepts and visual tasks via in-situ linguistic interactions with human users. Previous approaches in computer vision have either used large pre-trained visual models to infer novel objects zero-shot, or added novel concepts along with their attributes and representations to a concept hierarchy. We extend the approaches that focus on learning visual concept hierarchies and take this ability one step further to demonstrate novel task solving on robots along with the learned visual concepts. To enable a visual concept learner to solve robotics tasks one-shot, we developed two distinct techniques. Firstly, we propose a novel approach, Hi-Viscont(HIerarchical VISual CONcept learner for Task), which augments information of a novel concept, that is being taught, to its parent nodes within a concept hierarchy. This information propagation allows all concepts in a hierarchy to update as novel concepts are taught in a continual learning setting. Secondly, we represent a visual task as a scene graph with language annotations, allowing us to create novel permutations of a demonstrated task zero-shot in-situ. We compared Hi-Viscont with the baseline model (FALCON~\cite{mei2022falcon}) on visual question answering(VQA) in three domains. While being comparable to the baseline model on leaf level concepts, Hi-Viscont achieves an improvement of over 9% on non-leaf concepts on average. Additionally, we provide a demonstration where a human user teaches the robot visual tasks and concepts interactively. With these results we demonstrate the ability of our model to learn tasks and concepts in a continual learning setting on the robot.
Submission Number: 63
Loading