MOSAIC: Modular Foundation Models for Assistive and Interactive Cooking

Published: 25 Oct 2024, Last Modified: 03 Nov 20242024 CoRoboLearn OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Foundation Models, Human-Robot Interaction, Model Learning
TL;DR: MOSAIC is a modular architecture leveraging pre-trained foundation models to enable multiple home robots to collaboratively cook with humans.
Abstract: We present MOSAIC, a modular architecture for coordinating multiple robots to a) interact with users using natural language and b) manipulate an open vocabulary of everyday objects. MOSAIC employs modularity at several levels: it leverages multiple large-scale pre-trained models for high-level tasks like language and image recognition, while using streamlined modules designed for low-level task-specific control. This decomposition allows us to reap the complementary benefits of foundation models as well as precise, more specialized models. Pieced together, our system is able to scale to complex tasks that involve coordinating multiple robots and humans. First, we unit-test individual modules with 180 episodes of visuomotor picking, 60 episodes of human motion forecasting, and 46 online user evaluations of the task planner. We then extensively evaluate MOSAIC with 60 end-to-end trials. We discuss crucial design decisions, limitations of the current system, and open challenges in this domain. The project’s website is at https://portal-cornell.github.io/MOSAIC/
Supplementary Material: zip
Submission Number: 16
Loading