Assembly-R1: 3D Assembly Reasoning via RL-based Vision Language Models

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Assembly, VQA, VLM, GRPO
TL;DR: We present Assembly-R1,a GRPO-based reasoning model, along with a new VQA benchmark, namely FurniBench.
Abstract: Part assembly requires agents to possess precise spatial interpretation and multi-step structural reasoning. While large Vision-Language Models (VLMs) have shown promising capabilities in general Visual Question Answering (VQA), existing benchmarks inadequately reflect the complexities inherent in assembly reasoning. To bridge this gap, we introduce FurniBench, an assembly-specific VQA benchmark, coupled with FurniQA, a large-scale dataset targeting part recognition, connectivity reasoning, step planning, etc. Using Qwen2-VL-2B-Instruct as a base model, with 39.1% accuracy on FurniBench, we first establish a supervised fine-tuning (SFT) baseline, which highlights both the benefits and the limitations of SFT in this domain. Building on this, we propose Assembly-R1, a model trained via Group Relative Policy Optimization (GRPO). With enhanced reasoning capabilities, Assembly-R1 achieves an accuracy of 73.6%, outperforming other baselines on FurniBench by a large margin. Furthermore, the consistent gain on zero-shot Out-of-Domain (OOD) spatial understanding and Embodied AI benchmarks indicates that Assembly-R1 acquires transferable spatial skills applicable to broader Embodied AI scenarios. This work establishes FurniBench as a critical resource for both diagnosing deficits in current VLMs and teaching the fundamental structural reasoning required for downstream applications. We will release the dataset and code upon acceptance of this paper.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18531
Loading