Significance of Fairly Distributed Instances and Optimal Ratio for Validation Set in Machine Learning

Hina Nasir; Dr Archana Pandita; Chaudhary Nauman bin Nasir

Significance of Fairly Distributed Instances and Optimal Ratio for Validation Set in Machine Learning

Hina Nasir, Dr Archana Pandita, Chaudhary Nauman bin Nasir

24 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Primary Area: general machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Data Split, Support Points, SPlit, validation Set, Optimal Ratio, Significance of validation set

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Significant research aims for balanced training data, but neglects equitable validation set distribution. This study highlights how fair representation boosts learning accuracy through data splitting.

Abstract: Machine learning plays a crucial role in various research areas and industries. The effectiveness of machine learning models relies heavily on the quality and quantity of training data. To evaluate model performance on unseen data, it is important to divide the data into training and testing data sets. A three-way split into train-validation-test data-sets is also commonly used to create robust and generalized models. Validation set helps in tuning hyper-parameters to mitigate the problem of overfitting. It is of utmost importance to achieve precise and true portrayal of data across all three categories of data-sets: training, testing, and validation. Previous research has explored various statistical techniques such as 'SPlit' aimed to ensure proper membership of the complete data in the test set. Despite the utilization of these techniques, Insufficient evidence exists regarding the equitable treatment of the validation set. Although cross-validation is widely used for validation, randomly selecting the validation part may not be the complete representative of overall data, hindering the creation of a generalized model suitable for the test data. This work focuses on extracting validation sets using the Support Points method in 'SPlit' to obtain accurate data membership. Results demonstrate significant accuracy improvement when both test and validation sets are selected using the Support Points method.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9427

Loading