Keywords: Text-to-CAD, Command Sequnce Representation, LLM, Pointer
Abstract: Constructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large Language Model (LLM) have inspired the LLM-based CAD generation by representing CAD as command sequences. But these methods struggle in practical scenarios because command sequence representation does not support entity selection~(e.g. faces or edges), limiting its ability to support complex editing operations such as \textit{chamfer} or \textit{fillet}.
Further, the discretization of a continuous variable during \textit{sketch} and \textit{extrude} operations may result in topological errors.
To address these limitations, we present \textbf{Pointer-CAD}, a novel LLM-based CAD generation framework that leverages a pointer-based command sequence representation to explicitly incorporate the geometric information of B-rep models into sequential modeling. In particular, Pointer-CAD decomposes CAD model generation into steps, conditioning the generation of each subsequent step on both the textual description and the B-rep generated from previous steps. Whenever an operation requires the selection of a specific geometric entity, the LLM predicts a \textit{Pointer} that selects the most feature-consistent candidate from the available set. Such a selection operation also reduces the quantization error in the command sequence-based representation.
To support the training of Pointer-CAD, we develop a data annotation pipeline that produces expert-level natural language descriptions and apply it to build a dataset of approximately \textbf{575K CAD models}.
Extensive experimental results demonstrate that Pointer-CAD effectively supports the generation of complex geometric structures and reduces segmentation error to the order of $10^{-3}$, \textbf{a $100\times$ improvement} over prior methods, thereby significantly mitigating the topological inaccuracies introduced by quantization error.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9122
Loading