Abstract: This study presents a control framework leveraging vision language models (VLMs) for multiple tasks
and robots. Notably, existing control methods using VLMs have achieved high performance in various tasks and robots in the training environment. However, these methods incur high costs for
learning control policies for tasks and robots other than those in the training environment. Considering the application of industrial and household robots, learning in novel environments where robots
are introduced is challenging. To address this issue, we propose a control framework that does not
require learning control policies. Our framework combines the vision-language CLIP model with a
randomized control. CLIP computes the similarity between images and texts by embedding them in
the feature space. This study employs CLIP to compute the similarity between camera images and
text representing the target state. In our method, the robot is controlled by a randomized controller
that simultaneously explores and increases the similarity gradients. Moreover, we fine-tune the CLIP
to improve the performance of the proposed method. Consequently, we confirm the effectiveness
of our approach through a multitask simulation and a real robot experiment using a two-wheeled
robot and robot arm.
Loading