Imitation Learning for Robot Motions Based on Multimodal Foundation Model and Style Transformation

Masaya Nakano, Masatoshi Nagano, Tomoaki Nakamura

Published: 2025, Last Modified: 02 Apr 2026IEEE Access 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, robots capable of coexisting with humans have become increasingly common. These robots are expected to perform various actions depending on their environment. Imitation learning, where a robot learns motions from human data obtained through motion capture or computer vision using deep learning, has been studied. However, these conventional methods are costly because they require a system to convert the instructor’s motion information into the robot’s joint angles and a dataset that uniquely corresponds to the human and robot postures. This study proposes a motion imitation learning method called Motion Imitation and Modification using vIsual Characteristics (MIMIC), which consists of imitation and correction processes. In the imitation process, the cycle-consistent generative adversarial networks learns the visual correspondence between the human and robot using randomly moved images, and the convolutional neural network is trained to translate the robot image to joint angles. Therefore, a robot can imitate human motion using only visual information. In the correction process, the convolutional neural network is fine-tuned to output joint angles that are more appropriate for the task. The reward for accomplishing a task was calculated using the contrastive language-image pretraining. This two-process learning method reduced the cost of collecting data and learning task-appropriate motions. The experimental results demonstrate that the proposed method can complete various tasks more accurately than the conventional methods.