ORBIS: Open Dataset Can Rescue You From Dataset Bias

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Supplementary Material: zip
Primary Area: societal considerations including fairness, safety, privacy
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Dataset bias, Open dataset, Debiasing
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose a method for leveraging open dataset, e.g., web dataset, for mitigating dataset bias.
Abstract: Dataset bias, in the context of machine learning, pertains to the issue of unintended correlations between target labels and undesirable features found in specific training datasets. This phenomenon frequently arises in real-world scenarios and can lead to unintended behaviors. Researchers have devised techniques to alleviate this bias by diminishing the influence of samples with spurious correlations (i.e., bias-aligned samples) while assigning greater importance to other samples (i.e., bias-conflicting samples) during the training process. Prior approaches have mainly focused on leveraging given training datasets and have not explored the potential of harnessing open datasets, which have huge size of samples. Nonetheless, open datasets may contain noisy information posing a challenge for straightforward integration. In this paper, we introduce a novel method calld ORBIS to tackle dataset bias using open datasets. ORBIS comprises two core components. Firstly, it involves the selection of relevant samples from open datasets whose context aligns with the characteristics of the given training dataset. Subsequently, a debiased model is trained using both training dataset and selected samples. We assess the effectiveness of this proposed algorithm in conjunction with established debiasing methods and evaluate its performance on both synthetic and real-world benchmark datasets.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4850
Loading