Keywords: Representation Learning; Robotic Manipulation; Robotic Learning; Vision–Language–Action
Abstract: Vision-Language-Action (VLA) models have become a key framework for robotics, coupling multimodal perception with language-grounded decision making to enable cross-task generalization, dynamic interaction, and long-horizon planning. However, despite training on large-scale video and trajectory data, prevailing VLAs are predominantly imitation-driven and lack an intrinsic, spatiotemporal understanding of physical actions; as a result, their generalization degrades in unseen embodiments and contexts. In parallel, existing action-understanding approaches still fail to model temporally correlated action semantics and suffer from visual feature entanglement among the robot, manipulated objects, and background, hindering clean atomic-action semantics and reliable transfer.
We present RoboAct-CLIP, which addresses both issues with two components: (1) a curated single-action training set distilled from open-source robot videos via semantics-constrained action-unit segmentation and re-annotation, yielding purified clips each containing one atomic action (e.g., "grasp"); and (2) a temporal-decoupling architecture on a CLIP backbone. Concretely, a frozen CLIP visual encoder processes uniformly sampled frames; a Temporal Diff-Transformer operates on consecutive feature differences together with a start-end delta (the former emphasizes spatiotemporal dynamics, the latter summarizes action outcome); the fused representation is routed into subject/object/action branches with orthogonality constraints; and a compositional contrastive objective aligns branch-wise visual features with templated texts, with an additional recombination alignment loss between remixed branch features and their corresponding texts to further strengthen disentanglement.
Used as a frozen backbone, RoboAct-CLIP supports lightweight policy heads and reduces per-task tuning. In LIBERO and Franka Kitchen simulation, RoboAct-CLIP improves success rate by 12% and 5.1% over strong VLA baselines and exhibits better generalization in multi-object and unseen tasks; real-world evaluations on a single physical robot arm confirm stable atomic-action execution, with RoboAct-CLIP kept frozen and only the downstream policy adapted using task-specific data collected on the same platform. These results indicate that explicit temporal modeling plus factorized action/object/agent representations offers a simple, scalable path to more reliable VLA-based manipulation.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 19239
Loading