Iterative Compositional Data Generation for Robot Control

TMLR Paper6906 Authors

08 Jan 2026 (modified: 10 Apr 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We summarize our key changes here and provide individual responses to answer specific questions raised by the reviewers (highlighting additions in blue and deletions in red in the revised manuscript): 1. Stability of self-improvement (Figure 6 and 13, top middle): We report the raw per-iteration mean success rate in Fig. 6 (top middle) and Fig. 13 for the scaling setting. This metric shows that improvements are consistent across iterations and are not driven by best-of or accumulated effects. 2. Task coverage and data quality (Figure 6 and 13, bottom middle): We additionally report task coverage (number of high-quality datasets) in Fig. 6 and Fig.13 (bottom middle). Coverage measures the fraction of tasks for which synthetic datasets pass the quality threshold and are incorporated into training. This demonstrates that performance is not driven by occasional successes on many tasks, but by the consistent generation of reliable datasets that are useful for downstream learning. 3. Attention analysis (Section 4.4): We extend Fig. 9 to the full 8-layer model across iterations. The learned dependency structure remains stable throughout training, with a consistent importance ordering (robot > goal > object > obstacle), indicating preserved and progressively refined compositional relationships. 4. Environment interaction efficiency (Section 4.5): To directly address the cost of environment interaction raised by the reviewers, we quantify the number of interactions required by our iterative filtering procedure and compare it to RLPD (Ball et al., 2023), a standard offline to online RL baseline. Compared to RLPD, which uses all 14 training datasets as offline data as well as online interactions from the test tasks, our method solves tasks within 20k environment steps, while RLPD fails completely even after 100k environment steps per task, demonstrating substantially improved interaction efficiency. 5. Utility of low-success datasets (Section 4.6): We show that datasets containing even rare successful trajectories significantly accelerate RL. In contrast, online RL without such data fails to achieve any success even after an order of magnitude more interaction steps, highlighting the importance of broad task coverage. 6. Full-scale experiments (Section 4.7): We move the 56/256 results to the main paper and clarify that both the monolithic baseline and our semantic compositional architecture achieve higher initial zero-shot performance in this setting due to increased task diversity, and exhibit similar coverage across iterations. The semantic compositional model nevertheless achieves consistently higher success due to better data quality, demonstrating that our approach scales robustly to larger task sets. Note that the 14/64 setting is more challenging due to increased combinatorial sparsity, leading to a larger performance gap between methods. 7. Related work clarification (Section 5): We expand comparisons to prior world modeling and video generation approaches, emphasizing that our method generates full transitions for policy learning on unseen tasks, rather than focusing on data collection, visual diversity within single tasks, or planner-based usage. 8. Model collapse discussion Section 3.3): We adjusted the claim that the architecture prevents model collapse by communicating more clearly that the modular architecture localizes task-specific updates, which mitigates the propagation of low-quality data, rather than fully preventing model collapse. 9. Real-world relevance (Section 6): We expand discussion of applicability to robotics, highlighting advantages in simulation settings and the potential to scale sim-to-real transfer by accelerating expert policy acquisition for downstream distillation. We also outline extensions to visual observations which we believe are out of scope for the present manuscript. 10. Compute cost analysis (Appendix B4): We add a detailed large-scale cost analysis.
Assigned Action Editor: ~Dennis_J._N._J._Soemers1
Submission Number: 6906
Loading