Learning Skills from Action-Free Videos

ICLR 2026 Conference Submission14397 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: robotics, learning from videos, skill abstractions
Abstract: Learning from videos holds great promise for enabling generalist robots by leveraging diverse visual data beyond traditional robot datasets. Videos often contain recurring skills (e.g., grasping, lifting) across different tasks and environments. While skill-based methods can acquire reusable behaviors, they typically rely on clean, action-labeled data, which limits their use in action-free video sources. On the other hand, existing learning-from-video methods often train monolithic models or focus on single-step dynamics, reducing their ability to extract and compose skills for efficient multitask learning and long-horizon planning. In this work, we introduce Skill Abstraction from Optical Flow (SOF), a framework for skill learning directly from action-free videos. To overcome the absence of action labels, we propose using optical flow as a surrogate for action and adapting the existing skill-learning algorithm to operate on flow-based representations. Our model learns to plan in the skill space and translates these flow-based plans into executable actions. Experiments show that our approach consistently improves performance in both multitask and long-horizon settings, demonstrating the ability to acquire and compose skills directly from raw visual data.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 14397
Loading