Atomic Data Groups: An issue in train-test splits for the real world as demonstrated through digital hardware design

ICLR 2024 Workshop DMLR Submission89 Authors

Published: 04 Mar 2024, Last Modified: 02 May 2024DMLR @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: VLSI, CAD, FPGA, routing, supervised learning, information leakage
TL;DR: We identify the existence of atomic data groups, sets of highly correlated data that co-occur in the real world, and we produce results showing that they can cause model generalization overestimation if not accounted for.
Abstract: Machine learning (ML) has proved useful across a wide range of scientific applications. Supervised learning in particular has been successfully applied for solving prediction problems in the domain of very-large-scale integration computer-aided design (VLSI CAD), where function-based designs of digital hardware must be translated into physical designs for implementation on semiconductor devices. To avoid overestimating ML models' generalization capabilities for real-world deployments in such domains, good practices utilize realistic data and avoid test set information leakage during model preparation. In this paper we identify a further consideration in the form of atomic data groups, which are sets of very highly correlated data that may also lead to such overestimation if not accounted for in train-test splits during model evaluation. We investigate the potential impact of atomic data groups in experimental design through a case study of the VLSI CAD circuit design routing process for field-programmable gate arrays (FPGAs). Our investigations show that model performance in deployment is overestimated by 38% in this case study when atomic data groups are ignored. We hope that these results encourage other ML practitioners in different scientific domains to be critical of their train-test splits and identify when atomic data groups are relevant to their model evaluations.
Primary Subject Area: Domain specific data issues
Paper Type: Research paper: up to 8 pages
Participation Mode: In-person
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 89
Loading