Keywords: Vision-Language Models, Code Generation, Digital Gene
Abstract: Building structured world representations for robotic agents that can generalize and interact with the physical world is a core challenge in AI. The recently proposed $\textit{Digital Gene}$ offers a promising direction by representing objects as explicit, programmatic blueprints, addressing the generalization and interpretability bottlenecks of end-to-end learning paradigms. However, the practical application of this technology is hindered by a critical bottleneck: creating these genes for real-world objects. Prior methods rely on 3D data, which is difficult to acquire at scale, while parsing directly from ubiquitous 2D images remains an unsolved challenge.
In this work, we introduce $\textbf{GeneVLM}$, a vision-language framework that addresses this bottleneck by automatically parsing executable Digital Gene from a single 2D image.
First, we propose a specialized and scalable model designed for the image-to-gene parsing task. Second, to enable its training, we design an efficient and scalable procedural pipeline to synthesize a diverse, multi-million-pair dataset of images and their corresponding Digital Genes. Third, to facilitate rigorous evaluation, we establish and release the first comprehensive, multi-dimensional benchmark for this task. Our experiments show that GeneVLM successfully recovers complex object structures and exhibits consistent performance scaling with increased model size, validating the effectiveness of our integrated approach.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18765
Loading