ExtraMix: Extrapolatable Data Augmentation for Regression using Generative ModelsDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: mixup, out-of-distribution, optimization, generative models, molecule
TL;DR: We introduce a new data augmentation method of non-Euclidean data for regression tasks. This method exploits a mixup concept for generating extrapolated samples. Our method can not only generate reliable pseudo-labels, but also improve predictors.
Abstract: The primary objective of material science is discovery of novel materials. Because an unseen region can have a high probability of target materials (molecules), high predictive accuracy in out-of-distribution and few-shot regions is essential. However, limited data are available in material science because of high labeling costs. To overcome these difficulties, numerous techniques have been proposed for image and text domains. However, applying these techniques to material data is difficult because the data consists of combinatorial (non-Euclidean) input and continuous labels. In particular, in mixup-based methods, mixed labels are clustered in the middle range of the training set, which renders structured samples invalid. In this study, a novel data augmentation method is proposed for non-Euclidean input with regression tasks. (1) A mixup technique capable of extrapolation is defined to broaden not only the structure but also the label distribution. In contrast to existing mixup-based methods, the proposed method minimizes label imbalance. (2) The proposed method optimizes pseudo-label from the mixup-based approaches using decoder's knowledge of generative models. We proved that the proposed method generates high-quality pseudo data for the ZINC database. Furthermore, the phosphorescent organic light-emitting diode was used to prove that the method is effective in real problems with large-sized and highly complex properties. Moreover, this method can improve property prediction models.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Machine Learning for Sciences (eg biology, physics, health sciences, social sciences, climate/sustainability )
5 Replies

Loading