VAFOR: Proactive Voice Assistant for Object Retrieval in the Physical World

Bekatan Satyev, Hyemin Ahn

Published: 01 Jan 2023, Last Modified: 11 Nov 2024RO-MAN 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we present a proactive robotic voice assistant with a perceive-reason-act loop that carries out pick-and-place operations based on verbal commands. Unlike existing systems, our robot can retrieve a target object not only when the target is explicitly spelled out, but also given an indirect command that implicitly reflects the human intention or emotion. For instance, when the verbal command is “I had a busy day, so I didn’t have much to eat.”, the target object would be something that can help with hunger. To successfully estimate the target object from indirect commands, our framework consists of separate modules for the complete perceive-reason-act loop as follows. First, for perception, it runs an object detector on the robot’s onboard computer to detect all objects in the surroundings and records a verbal command from a microphone. Second, for reasoning, a list of available objects as well as a transcription of the verbal command are integrated into a prompt for a Large Language Model (LLM) in order to identify the target object in the command. Finally, for action, a TurtleBot3 with a 5 DOF robotic arm finds the target object and brings it to the human. Our experiments show that with a properly designed prompt, the robot can identify the correct target object from implicit commands with at most 97% accuracy. In addition, it is shown that the technique of fine-tuning a language model based on the proposed prompt designing process amplifies the performance of the smallest language model by a factor of five. Our data and code are available at https://github.com/bekatan/vafor