CSO: Refining Robotic Policies via Skill Distribution Alignment and Skill-Grained Optimization

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Embodied AI, Imitation Learning, Reinforcement Learning, Latent Variable Models
TL;DR: We propose Cascaded Skills Optimization, a two-stage framework that first aligns a skill prior via rejection sampling, then refines it with our proposed skill-level policy optimization.
Abstract: Discretizing continuous actions into skills using methods like VQ-VAE has emerged as a powerful paradigm for robotic manipulation. However, the quantization errors in discretizing continuous actions yield a suboptimal training distribution for the prior, degrading its performance. While reinforcement learning offers a path for refinement, its direct application is challenging, suffering from unstable encoder updates and a granularity dilemma in importance sampling. To address these challenges, we introduce Cascaded Skills Optimization (CSO), a two-stage post-training framework. First, to rectify the initial policy's suboptimal distribution, CSO employs Rejection-Sampling Supervised Fine-tuning to align the model's observation-to-skill mapping with the distribution of successful online trajectories via supervised fine-tuning. Second, to resolve the granularity dilemma, CSO introduces Skills Policy Optimization, which computes an independent, clipped importance ratio for each skill, enabling more stable and efficient updates. Our post-training strategy delivers highly competitive performance on challenging benchmarks like LIBERO and MetaWorld, with its effectiveness further validated on a physical robot.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 10278
Loading