GenCAD: Image-Conditioned Computer-Aided Design Generation with Transformer-Based Contrastive Representation and Diffusion Priors
Abstract: The creation of manufacturable and editable 3D shapes through Computer-Aided Design (CAD) remains a highly manual and time-consuming task, hampered by the complex topology of boundary representations of 3D solids and unintuitive design tools. While most work in the 3D shape generation literature focuses on representations like meshes, voxels, or point clouds, practical engineering applications demand the modifiability and manufacturability of CAD models and the ability for multi-modal conditional CAD model generation. This paper introduces GenCAD, a generative model that employs autoregressive transformers with a contrastive learning framework and latent diffusion models to transform image inputs into parametric CAD command sequences, resulting in editable 3D shape representations. Extensive evaluations demonstrate that GenCAD significantly outperforms existing state-of-the-art methods in terms of the unconditional and conditional generations of CAD models. Additionally, the contrastive learning framework of GenCAD facilitates the retrieval of CAD models using image queries from large CAD databases, which is a critical challenge within the CAD community. Our results provide a significant step forward in highlighting the potential of generative models to expedite the entire design-to-production pipeline and
seamlessly integrate different design modalities.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We sincerely thank the Action Editor for the thoughtful feedback and for carefully engaging with our manuscript and the reviewer comments.
Following the suggestion, we experimented with removing the average pooling operation from our model. However, doing so resulted in a generative model that consistently failed to produce valid CAD programs. We believe this is due to the high precision and structured nature required for each CAD operation to yield a valid 3D shape. Without average pooling, the model struggles to capture a stable representation, leading to invalid outputs. As such, we chose not to include this variant in the final paper but acknowledge it as a promising direction for future investigation.
In addition, to further address the AE’s suggestion on the vision backbone, we trained a new image encoder using a Vision Transformer (ViT) architecture in place of ResNet. We have included the results in the updated Table 3 in the Appendix, which shows comparative performance with ResNet. We have also made minor edits to Figure 1 and revised several text segments throughout the paper to prepare the manuscript for the camera-ready version.
Video: https://gencad.github.io/static/videos/teaser.mp4
Code: https://gencad.github.io/
Assigned Action Editor: ~Kui_Jia1
Submission Number: 3329
Loading