Generalization Error Bounds for Learning under Censored Feedback

Generalization Error Bounds for Learning under Censored Feedback

TMLR Paper3358 Authors

19 Sept 2024 (modified: 11 Mar 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Generalization error bounds from learning theory provide statistical guarantees on how well an algorithm will perform on previously unseen data. In this paper, we characterize the impacts of data non-IIDness due to censored feedback (a.k.a. selective labeling bias) on such bounds. We first derive an extension of the well-known Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, which characterizes the gap between empirical and theoretical CDFs given \emph{IID} data, to problems with \emph{non-IID data due to censored feedback}. We then use this CDF error bound to provide a bound on the generalization error guarantees of a classifier trained on such non-IID data. We show that existing generalization error bounds (which do not account for censored feedback) fail to correctly capture the model's generalization guarantees, verifying the need for our bounds. We further analyze the effectiveness of (pure and bounded) exploration techniques, proposed by recent literature as a way to alleviate censored feedback, on improving our error bounds. Together, our findings illustrate how a decision maker should account for the trade-off between strengthening the generalization guarantees of an algorithm and the costs incurred in data collection when future data availability is limited by censored feedback.

Submission Length: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=rvoOttpqpY

Changes Since Last Submission: Dear AE and Reviewers: Enclosed please find our updated manuscript. Following the comments received on our earlier submission, we have made a number of changes to our problem setting, theoretical analysis and results, and numerical experiments. Specifically, similar to the existing literature on active learning, we assume that the algorithm starts from an initial training dataset containing exactly $n_y$ samples on each label group $y$, and accordingly, it determines a fixed decision threshold $\theta$. We continue to follow this literature by assuming that $n_y$ and $\theta$ are fixed and known. However, in this paper, we introduce the new consideration that this threshold $\theta$ leads to censored feedback, so that out of the $n_y$ starting samples, the number $m_y$ falling below the decision threshold becomes of interest. This $m_y$ may be considered to be random depending on the realization of the training data. Therefore, in the revision, rather than assuming that the number of initial samples $m$ falling below the decision threshold $\theta$ is a constant (which we had done in the earlier version), we now treat it as a random variable in this revised manuscript. With this change, we have modified our theorems/corollaries, and re-done the numerical experiments. As a result, our bounds in Theorems 2 and 5 are now derived using the law of total probability, conditioning on the realization of $m$ in the censored region and leveraging our results on fixed $m$. Similarly, with the introduction of exploration, our bounds in Theorems 3 and 6 are also derived using the law of total probability, conditioning on the realizations of samples in both the censored and explored regions. Experimentally, all previous claims and findings remain valid. Differences in the experimental results in Figures 3 and 5 are due to the randomness of the generated synthetic data. Likewise, the differences in Figure 4 are caused by randomness in the sample arrival order, exploration, and the number of samples within the censored region. We also provide a comparison of the \emph{a priori} and \emph{a posteriori} bounds in the appendix through a numerical experiment. We again note that our analytical results are based on a fixed decision threshold $\theta$, which is a common assumption in the active learning literature. For instance, in [1, 2], the initial decision thresholds are derived from a realized/known dataset. In such cases, both $m$ and $\theta$ are deemed to be data-dependent constants. That said, even if $m$ is considered as a variable, the assumption of a fixed decision threshold in our classification model can remain valid, as the randomness in the label 0 and 1 distributions may lead to any assumed $\theta$. [1] Corinna Cortes, Giulia DeSalvo, Claudio Gentile, Mehryar Mohri, and Ningshan Zhang. Region-based active learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2801–2809. PMLR, 2019. [2] Cheolhei Lee, Kaiwen Wang, Jianguo Wu, Wenjun Cai, and Xiaowei Yue. Partitioned active learning for heterogeneous systems. Journal of Computing and Information Science in Engineering, 23(4):041009, 2023.

Assigned Action Editor: ~Matthew_J._Holland1

Submission Number: 3358

Loading