Synthetic Data as Validation

Qixin Hu; Alan Yuille; Zongwei Zhou

Synthetic Data as Validation

Qixin Hu, Alan Yuille, Zongwei Zhou

16 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: zip

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Data Synthesis, Cancer Detection, Medical Image Analysis

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract:

This study leverages synthetic data as a validation set to reduce overfitting and ease the selection of the best model in AI development. While synthetic data have been used for augmenting the training set, we find that synthetic data can also significantly diversify the validation set, offering marked advantages in domains like healthcare, where data are typically limited, sensitive, and from out-domain sources (i.e., hospitals). In this study, we illustrate the effectiveness of synthetic data for early cancer detection in computed tomography (CT) volumes, where synthetic tumors are generated and superimposed onto healthy organs, thereby creating an extensive dataset for rigorous validation. Using synthetic data as validation can improve AI robustness in both in-domain and out-domain test sets. Furthermore, we establish a new continual learning framework that continuously trains AI models on a stream of out-domain data with synthetic tumors. The AI model trained and validated in dynamically expanding synthetic data can consistently outperform models trained and validated exclusively on real-world data. Specifically, the DSC score of liver tumor segmentation improves from 26.7% (95% CI: 22.6%--30.9%) to 34.5% (30.8%--38.2%) when evaluated on an in-domain dataset and from 31.1% (26.0%--36.2%) to 35.4% (32.1%--38.7%) on an out-domain dataset. Importantly, the performance gain is particularly significant in identifying very tiny liver tumors (radius $<$ 5mm) in CT volumes, with Sensitivity improving from 33.1% to 55.4% on an in-domain dataset and 33.9% to 52.3% on an out-domain dataset, justifying the efficacy in the early detection of cancer. The application of synthetic data, from both training and validation perspectives, underlines a promising avenue to enhance AI model robustness when dealing with data from varying domains. Our code is attached as supplementary and will be publicly available.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 743

Loading