Keywords: 3D Assembly, VQA, VLM, GRPO
TL;DR: We present Assembly-R1,a GRPO-based reasoning model, along with a new VQA benchmark, namely FurniBench.
Abstract: 3D assembly tasks require automatic agents’ precise interpretations of visual scenes and structural reasoning. While large Vision-Language Models (VLMs) have shown promising capabilities in general Visual Question Answering (VQA), existing benchmarks inadequately reflect the complexities inherent in assembly reasoning. In this paper, we introduce FurniBench, an assembly-specific VQA benchmark, together with FurniQA, a large-scale dataset covering tasks such as part recognition, connection reasoning, and step ordering. Using Qwen2-VL-2B-Instruct as a base model (39.1% accuracy on FurniBench), we first establish a supervised fine-tuning (SFT) baseline, which highlights both the benefits and the limitations of SFT in this domain. Building on this, we propose Assembly-R1, trained with Group Relative Policy Optimization (GRPO), which substantially enhances reasoning ability and achieves 71.7% accuracy, outperforming the base model, the SFT baseline, and other open-source and closed-source commercial VLM candidates. Our results demonstrate that reinforcement learning offers a more robust path toward generalizable 3D assembly reasoning. We will release the dataset and code upon acceptance of this paper.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18531
Loading