MetaGen: Assessing Conditions of Generalization

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Generalization, HPO, Model Comparison, Bayesian
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: The Machine Learning (ML) community has been criticized for irreproducible results that do not generalize. Current ML practices identify the best-performing configuration by sampling a limited set of them (e.g. with Hyperparameter Optimization) and evaluating their performance. Model comparison is then performed by aggregating the performance of several trials (e.g. with ANOVA) or based on the best-performing trial (proof-by-existence). We find that both methods of comparison can be inapplicable as ML performance metrics with their corresponding hyperparameter configurations, performance manifold, have unequal variance (heteroscedastic). As such, it is important to know what hyperparameters perform robustly, conditions of generalization. To our knowledge, our work is the first to evaluate the model comparison problem in ML that impacts the safety of existing HPO methods as well as the explainability of the configurations for which a model improves. We propose MetaGen for uncertainty estimation on the performance manifold and identify contiguous regions of consistent performance. We use Meta Gen for post-hoc analysis to compare between hyperparameter regions as opposed to single-point estimates. We extend MetaGen in an online manner and apply HPO on the hyperparameter region for which a method performs robustly. When used for post-hoc analysis, MetaGen avoids bias from misleading evidence by outliers or aggregate effects and improves the explainability of a method’s performance. When used in an online manner, our method can improve the safety of systems that need to be iteratively re-trained by improving the robustness and performance by as much as 5.22% for the three HPO sampling methods that we evaluate. We use the results of 61,475 experimental trials from Transformer, VGG, ResNet trained on 8 different datasets, 200,628 trials from 3 NAS benchmarks and evaluate our method in identifying the hyperparameter regions where a model performs robustly. Our method improves model comparison and is robust to heteroscedasticity.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4083
Loading