Learning Diffusion Policy from Primitive Skills for Robot Manipulation

Zhihao Gu, Ming Yang, Difan Zou, Dong Xu

Published: 13 Mar 2026, Last Modified: 14 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Diffusion policies (DP) have recently shown great promise for generating actions in robotic manipulation. However, ex- isting approaches often rely on global instructions to pro- duce short-term control signals, which can result in misalign- ment in action generation. We conjecture that the primitive skills, referred to as fine-grained, short-horizon manipula- tions, such as “move up” and “open the gripper”, provide a more intuitive and effective interface for robot learning. To bridge this gap, we propose SDP, a skill-conditioned DP that integrates interpretable skill learning with conditional action planning. SDP abstracts eight reusable primitive skills across tasks and employs a vision-language model to extract discrete representations from visual observations and language in- structions. Based on them, a lightweight router network is de- signed to assign a desired primitive skill for each state, which helps construct a single-skill policy to generate skill-aligned actions. By decomposing complex tasks into a sequence of primitive skills and selecting a single-skill policy, SDP en- sures skill-consistent behavior across diverse tasks. Extensive experiments on two challenging simulation benchmarks and real-world robot deployments demonstrate that SDP consis- tently outperforms SOTA methods, providing a new paradigm for skill-based robot learning with diffusion policies.