Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning

Published: 07 May 2025, Last Modified: 07 May 2025ICRA Workshop Human-Centered Robot LearningEveryoneRevisionsBibTeXCC BY 4.0
Workshop Statement: Our work, presented in Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning, directly supports the workshop’s theme of advancing human-centered robot learning through large-scale, multimodal models. We focus on the independent prediction of human intentions and robot actions from respective RGB video and voxelized RGB-D inputs, both framed within the same high-level manipulation task. By learning semantic correspondences across modalities, our approach lays the groundwork for future cross-modal alignment between human and robot behaviors. On the RH20T dataset, we report test accuracies of 71.67% for human intention recognition and 71.8% for robot action classification, demonstrating the model’s ability to handle diverse interaction patterns—a key challenge posed by the workshop. Additionally, we investigate challenges related to class imbalance and limited temporal context, underscoring the need for enhanced data acquisition strategies such as synthetic augmentation and temporal voxel encoding. Our contributions help inform the development of multimodal learning pipelines for shared human-robot task contexts in domains such as household assistance and healthcare.
Keywords: Human Robot Correspondence, Multimodal Learning
TL;DR: Our framework aligns human and robot "pick and place" behaviors using RGB video and voxelized RGB-D demos, achieving high accuracy after training on the RH20T dataset.
Abstract: Understanding action correspondence between humans and robots is essential for evaluating alignment in decision-making, particularly in human-robot collaboration and imitation learning within unstructured environments. We propose a multimodal demonstration learning framework that explicitly models human demonstrations from RGB video with robot demonstrations in voxelized RGB-D space. Focusing on the “pick and place” task from the RH20T dataset, we utilize data from 7 users across 10 diverse scenes. Our approach combines ResNet-based visual encoding for human intention modeling and a Perceiver Transformer for voxel-based robot action prediction. After 2000 training epochs, the human model reaches 71.67\% accuracy, and the robot model achieves 71.8\% accuracy, demonstrating the framework's potential for aligning complex, multimodal human and robot behaviors in manipulation tasks.
Submission Number: 32
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview