Assembly-R1: 3D Assembly Reasoning via RL-based Vision Language Models

Assembly-R1: 3D Assembly Reasoning via RL-based Vision Language Models

ICLR 2026 Conference Submission18531 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D Assembly, VQA, VLM, GRPO

TL;DR: We present Assembly-R1,a GRPO-based reasoning model, along with a new VQA benchmark, namely FurniBench.

Abstract: 3D assembly tasks require automatic agents’ precise interpretations of visual scenes and structural reasoning. While large Vision-Language Models (VLMs) have shown promising capabilities in general Visual Question Answering (VQA), existing benchmarks inadequately reflect the complexities inherent in assembly reasoning. In this paper, we introduce FurniBench, an assembly-specific VQA benchmark, together with FurniQA, a large-scale dataset covering tasks such as part recognition, connection reasoning, and step ordering. Using Qwen2-VL-2B-Instruct as a base model (39.1% accuracy on FurniBench), we first establish a supervised fine-tuning (SFT) baseline, which highlights both the benefits and the limitations of SFT in this domain. Building on this, we propose Assembly-R1, trained with Group Relative Policy Optimization (GRPO), which substantially enhances reasoning ability and achieves 71.7% accuracy, outperforming the base model, the SFT baseline, and other open-source and closed-source commercial VLM candidates. Our results demonstrate that reinforcement learning offers a more robust path toward generalizable 3D assembly reasoning. We will release the dataset and code upon acceptance of this paper.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 18531

Loading