From Aerial Twins to VLA Tuples: A Zero-NRE Data Factory with Caption Drift Evaluation

Published: 13 May 2026, Last Modified: 13 May 2026ICRA 2026: From Data to Decisions PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: sim-to-real transfer, Vision-Language-Action models, urban robotics, data factory, domain randomization, evaluation metrics, digital twins, synthetic data generation
Abstract: Urban robotic systems, including autonomous vehicles (AVs), UAVs, humanoid robots, and sidewalk delivery robots, share a common data bottleneck: generating perceptually realistic, geographically diverse training environments without expensive ground-vehicle survey fleets or prohibitive non-recurring engineering (NRE) costs. We present a VLM Data Factory: a four-stage, zero-NRE pipeline combining aerial digital twins, cloud-based physics simulation, a video-to-video world model for perceptual augmentation, and a vision-language model for automated semantic annotation, all on a pay-as you-go basis. We further introduce caption drift, a geometry invariant evaluation signal derived from changes in automatically generated scene captions under controlled perceptual variation: because geometry, agent trajectories, and physics are held fixed while only perception changes, any caption shift is attributable solely to perceptual domain gap. We demonstrate caption drift across 20+ structured conditions spanning weather, lighting, and world-model guidance parameters, and show it qualitatively tracks perceptual augmentation intensity. The pipeline generates complete VLA training tuples at $0.25USDs−1 per augmentation condition ($5.00s−1 across all 20 conditions), replacing NRE investments with usage-proportional expenditure.
Submission Number: 35
Loading