Keywords: Goal Specification, Diffusion Models
TL;DR: We generate stable, multi-level 3D structure based on a simple rough 2D front-view sketch, even when the sketch is incomplete or imprecise.
Abstract: Imagine a child sketching the Eiffel Tower and asking a robot to bring it to life. Today’s robot manipulation systems can’t act on such sketches directly—they require precise 3D block poses as goals, which in turn demand structural analysis and expert tools like CAD. We present *StackItUp*, a system that enables non-experts to specify complex 3D structures using only 2D front-view hand-drawn sketches. *StackItUp* introduces an abstract relation graph to bridge the gap between rough sketches and accurate 3D block arrangements, capturing the symbolic geometric relations (e.g., *left-of*) and stability patterns (e.g.,*two-pillar-bridge*) while discarding noisy metric details from sketches. It then grounds this graph to 3D poses using compositional diffusion models and iteratively updates it by predicting hidden internal and rear supports—critical for stability but absent from the sketch. Evaluated on sketches of iconic landmarks and modern house designs, *StackItUp* consistently produces stable, multilevel 3D structures and outperforms all baselines in both stability and visual resemblance.
Supplementary Material: zip
Submission Number: 587
Loading