Abstract: Data valuation is an emerging area in machine learning that aims to assess the contribution of training data to a model’s performance on unseen data. Among existing approaches, Data-OOB leverages out-of-bag (OOB) estimate to enable efficient data valuation and has shown promising results in tasks such as noisy data detection. However, our analysis reveals two key limitations of Data-OOB: (1) the weak learners it employs are vulnerable to noisy data, and (2) it suffers from information loss during the exploitation of weak learners. To address these issues, we propose a novel data valuation framework called OOB-CM. This method assigns a dynamic threshold to each data point and uses it to define a personalized curriculum for every weak learner. The curriculum selectively filters training samples at each stage of training, shielding weak learners from the influence of noise. Furthermore, we introduce a multi-round voting strategy that aggregates the discriminative signals from weak learners by selecting the top-\(p\) predicted classes for each sample. This strategy significantly improves the utilization of weak learners’ predictive power. We conduct comprehensive evaluations across multiple scenarios, and the results consistently demonstrate that OOB-CM outperforms existing methods in terms of robustness and effectiveness.
External IDs:dblp:conf/gpc/JiaoSWJLZ24
Loading