RoboAct-CLIP: Video-Driven Atomic Action Understanding for Robotics

Zhiyuan Zhang; Yuxin He; Junyu Shi; Yong Sun; Lijiang LIU; Zhengjie Zhang; Qiang Nie

RoboAct-CLIP: Video-Driven Atomic Action Understanding for Robotics

Zhiyuan Zhang, Yuxin He, Junyu Shi, Yong Sun, Lijiang LIU, Zhengjie Zhang, Qiang Nie

19 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Representation Learning; Robotic Manipulation; Robotic Learning; Vision–Language–Action

Abstract: Vision-Language-Action (VLA) models have become a key framework for robotics, coupling multimodal perception with language-grounded decision making to enable cross-task generalization, dynamic interaction, and long-horizon planning. However, despite training on large-scale video and trajectory data, prevailing VLAs are predominantly imitation-driven and lack an intrinsic, spatiotemporal understanding of physical actions; as a result, their generalization degrades in unseen embodiments and contexts. In parallel, existing action-understanding approaches still fail to model temporally correlated action semantics and suffer from visual feature entanglement among the robot, manipulated objects, and background, hindering clean atomic-action semantics and reliable transfer. We present RoboAct-CLIP, which addresses both issues with two components: (1) a curated single-action training set distilled from open-source robot videos via semantics-constrained action-unit segmentation and re-annotation, yielding purified clips each containing one atomic action (e.g., "grasp"); and (2) a temporal-decoupling architecture on a CLIP backbone. Concretely, a frozen CLIP visual encoder processes uniformly sampled frames; a Temporal Diff-Transformer operates on consecutive feature differences together with a start-end delta (the former emphasizes spatiotemporal dynamics, the latter summarizes action outcome); the fused representation is routed into subject/object/action branches with orthogonality constraints; and a compositional contrastive objective aligns branch-wise visual features with templated texts, with an additional recombination alignment loss between remixed branch features and their corresponding texts to further strengthen disentanglement. Used as a frozen backbone, RoboAct-CLIP supports lightweight policy heads and reduces per-task tuning. In LIBERO and Franka Kitchen simulation, RoboAct-CLIP improves success rate by 12% and 5.1% over strong VLA baselines and exhibits better generalization in multi-object and unseen tasks; real-world evaluations on a single physical robot arm confirm stable atomic-action execution, with RoboAct-CLIP kept frozen and only the downstream policy adapted using task-specific data collected on the same platform. These results indicate that explicit temporal modeling plus factorized action/object/agent representations offers a simple, scalable path to more reliable VLA-based manipulation.

Primary Area: applications to robotics, autonomy, planning

Submission Number: 19239

Loading