Towards Data-Efficient VLA Post-Training: a case study of an industrial task

Published: 13 May 2026, Last Modified: 13 May 2026ICRA 2026: From Data to Decisions PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data-Efficient Learning, Low-data regime, Real-world evaluation, Humanoid Robots, Vision-Language-Action (VLA) Models, Industrial manipulation, New embodiment adaptation, Diffusion-based policies
TL;DR: We post-train the NVIDIA GR00T VLA on an industrially motivated box-picking task using only 60 episodes, achieving a 97% success rate.
Abstract: Warehouse and logistics environments expose the limitations of rigid automation, where infrequent but costly edge cases, such as misaligned or damaged packages, remain unresolved. Humanoid robots are increasingly viewed as a flexible solution, motivating box picking as a practical benchmark for real-world deployment. In this work, we explore whether a Vision-Language-Action (VLA) model can provide a reliable, specialist humanoid policy in the low-data regime characteristic of early industrial deployment. We post-train NVIDIA GR00T N1.5 on a box-picking task using only 60 teleoperated demonstrations. We introduce a pragmatic data collection strategy focused on clear behaviour decomposition and sufficient per-behaviour coverage, emphasising that careful dataset design, rather than size alone, is critical for reliable low-data deployment. Despite the limited dataset, the resulting policy achieves a 97.0% success rate and generalises to unseen box orientations and substantially different lighting conditions. We compare our approach to the Improved 3D Diffusion Policy (iDP3), a from-scratch diffusion-based humanoid model, trained on the same dataset. Unlike the VLA, iDP3 fails to reliably learn key behaviours, highlighting the advantage of large-scale VLA pre-training when adapting to new humanoid embodiments for bounded industrial tasks.
Submission Number: 17
Loading