Track: Main Papers Track (6 to 9 pages)
Keywords: alignment, verification, high-confidence guarantees, finite-sample uncertainty, algorithmic fairness, statistical validity
Abstract: Machine learning systems are increasingly deployed in settings where failures to uphold safety or fairness can result in significant consequences. In such settings, practitioners may seek assurances that learned models satisfy safety and fairness criteria despite the statistical uncertainty induced by training with finite data. We study an existing framework for obtaining high-confidence guarantees that learned models satisfy such criteria under finite-sample uncertainty, and provide a unifying proof framework that identifies the minimal statistical conditions needed for these guarantees. We then consider settings in which the data available for training consists of a mixture of trusted data and model-inferred quantities, such as proxy labels or automated evaluations. These composite data regimes arise naturally in modern alignment pipelines, but they can violate the statistical assumptions required for high-confidence guarantees to hold. As a result, a model may only appear to satisfy safety or fairness criteria. We characterize how this failure arises, and derive sufficient conditions under which the guarantee can be recovered.
Submission Number: 25
Loading