Keywords: Dataset, UI Agents, Video Dataset, 3D Reasoning
TL;DR: VideoCAD is a dataset that evaluates AI's ability to interact with complex engineering tools and reason about 3D shapes
Abstract: Computer-Aided Design (CAD) is a time-consuming and complex process, requiring precise, long-horizon user interactions with intricate 3D interfaces. While recent advances in AI-driven user interface (UI) agents show promise, most existing datasets and methods focus on short, low-complexity tasks in mobile or web applications, failing to capture the demands of professional engineering tools. In this work, we introduce VideoCAD, the first attempt to model UI interactions for precision engineering tasks. Specifically, VideoCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations, generated using an automated framework for collecting high-fidelity UI action data from human-made CAD designs.
Compared to existing datasets, VideoCAD offers an order-of-magnitude increase in complexity for real-world engineering UI tasks, with time horizons up to $20\times$ longer than those in other datasets. We show two important downstream applications of VideoCAD:
(1) learning UI interactions from professional 3D CAD tools for precision tasks and (2) a visual question-answering (VQA) benchmark designed to evaluate multimodal large language models (LLMs) on spatial reasoning and video understanding. To learn the UI interactions, we propose VideoCADFormer, a state-of-the-art model for learning CAD interactions directly from video, which outperforms existing behavior cloning baselines. Both VideoCADFormer and the VQA benchmark derived from VideoCAD reveal key challenges in the current state of video-based UI understanding, including the need for precise action grounding, multi-modal and spatial reasoning, and long-horizon dependencies. Dataset and code available at: https://github.com/ghadinehme/VideoCAD.
Croissant File: json
Dataset URL: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/WX8PCK&faces-redirect=true
Code URL: https://github.com/ghadinehme/VideoCAD
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 1079
Loading