Keywords: Vision Language Action (VLA), Pretraining, Community datasets, Robotic manipulation
TL;DR: We introduce HeteroMotion, a 1,050-trajectory dataset that increases motion diversity for community VLA pretraining. Scaling and joint variance analysis show improved downstream performance and broader joint coverage.
Abstract: Vision-Language-Action (VLA) models have recently benefited from large-scale pretraining strategies inspired by advances in language and vision. However, unlike text and image domains, robotics remains constrained by limited embodied interaction data. While community efforts such as LeRobot community datasets have improved accessibility and standardization, existing datasets are often dominated by structurally similar, mostly pick and place tasks, limiting diversity in motion primitives. In this work, we investigate the role of motion diversity in scaling community collected VLA pretraining data. We introduce a manipulation dataset, called HeteroMotion, comprising 15 tasks across five behavior categories and 1,050 trajectories designed to expand action space coverage and reasoning complexity. Through controlled scaling HeteroMotion improves downstream real-world performance compared to direct fine-tuning or small-scale pretraining. A joint variance analysis further reveals that HeteroMotion, provides broader motion coverage across all robot joints relative to existing community datasets.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 5
Loading