CAD-bench: Benchmarking Language Models on Functional CAD Generation

Dhruv Saini

CAD-bench: Benchmarking Language Models on Functional CAD Generation

Dhruv Saini

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0

Keywords: CAD, parametric modeling, executable benchmarks, LLM evaluation, code generation, geometric correctness, rigid-body simulation, engineering AI

TL;DR: CAD-bench is an execution-based benchmark for evaluating language models on functional parametric CAD generation.

Abstract: Language-model agents are increasingly able to operate computer-aided design (CAD) toolchains, but current evaluations often measure only executable code, rendered appearance, or coarse geometric similarity. These proxies can miss the properties that matter for reliable artifact-producing agents in engineering settings: exact dimensions, mating interfaces, standards-like details, and functional behavior of assemblies. We introduce CAD-bench, an execution-based benchmark for evaluating language-model CAD agents. CAD-bench contains 17 tasks across four difficulty tiers, ranging from basic solids to threaded mating pairs and functional gear trains. Each submission is executed, exported as geometry, and evaluated with task-specific checks, including dimensional and pose verification, reference-geometry gates, thread-profile analysis, and Blender-based rigid-body simulation. The benchmark supports both one-shot CAD-code generation and agent harnesses that produce final STEP artifacts in an executable environment. Initial results show that CAD-bench is not saturated by current systems. The strongest standalone model reaches 59.9\% overall, while functional tasks remain near zero for most standalone runs. Agent harnesses perform better than one-shot generation, but still fail frequently on interfaces, standards-like details, and mechanisms. CAD-bench therefore exposes a gap between producing runnable CAD artifacts and producing CAD artifacts that satisfy mechanically meaningful task requirements, a distinction that is central to evaluating agents intended to operate in real engineering workflows.

Track: Regular Paper (9 pages)

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 279

Loading