Blox-Net: Generative Design-for-Robot-Assembly Using VLM Supervision, Physics Simulation, and a Robot with Reset

Andrew Goldberg, Kavish Kondap, Tianshuang Qiu, Zehan Ma, Letian Fu, Justin Kerr, Huang Huang, Kaiyuan Chen, Kuan Fang, Ken Goldberg

Published: 19 May 2025, Last Modified: 12 Nov 2025ICRAEveryoneCC BY 4.0

Abstract: Generative AI systems have shown impressive capabilities in creating text, code, and images. Inspired by the importance of research in industrial Design for Assembly, we introduce a novel problem: Generative Design-for-RobotAssembly (GDfRA). The task is to generate an assembly based on a natural language prompt (e.g., “giraffe”) and an image of available physical components, such as 3D-printed blocks. The output is an assembly, a spatial arrangement of these components, accompanied by instructions for a robot to build it. The output geometry must 1) resemble the requested object and 2) be reliably assembled by a 6 DoF robot arm with a suction gripper. We then present Blox-Net, a GDfRA system that combines generative vision language models with well-established methods in computer vision, simulation, perturbation analysis, motion planning, and physical robot experimentation to solve a class of GDfRA problems without human supervision. Blox-Net achieved a Top-1 accuracy of $\mathbf{6 3. 5 \%}$ in the semantic accuracy of its designed assemblies. Six designs, after Blox-Net's automated pertubation redesign, were reliably assembled by a robot, achieving near-perfect success across $\mathbf{1 0}$ consecutive assembly iterations with human intervention only during reset prior to assembly. The entire pipeline from the textual word to reliable physical assembly is performed without human intervention. Project Page: https://bloxnet.org/