Keywords: Sub-task Discovery, Robotic Learning
TL;DR: RDD is a controllable and efficient visual su-btask labelling algorithm.
Abstract: Long-horizon embodied agents increasingly combine foundation-model planners with low-level visuomotor policies, but the two levels are often trained from differently segmented data. In hierarchical vision-language-action (VLA) frameworks, a vision-language model (VLM)-based planner must decompose complex manipulation tasks into simpler sub-tasks that the low-level policy can execute. Finetuning such planners for a new task requires demonstrations segmented into sub-tasks, yet human annotation is expensive and heuristic segmentations can deviate from the visuomotor policy's training distribution, degrading embodied decision-making.
We propose a Retrieval-based Demonstration Decomposer (RDD), a training-free method that decomposes video demonstrations by retrieving visually similar sub-task intervals from the low-level policy's training data. RDD formulates sub-task identification as an optimal partitioning problem and solves it efficiently with dynamic programming, directly aligning the planner's finetuning data with the policy's learned capabilities. Experiments on simulation and real-world manipulation benchmarks show that RDD outperforms state-of-the-art heuristic decomposition methods and improves planner-policy coordination for long-horizon embodied tasks.
Submission Number: 18
Loading