Confidence-Guided Self-Training for Domain Adaptation
Abstract: Domain adaptation addresses the challenge of distributional shift between a labeled source domain and an unlabeled target domain. In gradual domain adaptation (GDA), the shift is assumed to occur through a sequence of intermediate domains, enabling smoother adaptation. A popular approach in this setting is self-training, where a model iteratively generates pseudo-labels for unlabeled data. However, pseudo-labeling errors can accumulate across rounds, especially under large shift, undermining generalization.
We develop a theoretical framework for self-training under gradual domain shift that explicitly quantifies and controls the pseudo-labeling error incurred at each round. Our first result is a modular generalization bound that decomposes the excess target risk into coverage,
pseudo-label error $(\varepsilon_k)$ on the accepted set,
domain shift, sample complexity, and regularization. Unlike prior bounds, our analysis separates the coverage penalty (due to rejecting inputs) from the pseudo-label error (controlled by confidence calibration or margin filtering, including Tsybakov-type noise via margin decay or calibration assumptions). We also provide the first theoretical justification for percentile (quantile) thresholding schemes used in practice: such schedules directly control coverage while tightening $\varepsilon_k$, yielding a principled coverage--noise tradeoff. Under mild conditions,
both terms accumulate only logarithmically, leading to improved generalization. We validate these insights across multiple GDA benchmarks, using both observed and OT-generated intermediate domains.
Submission Number: 1696
Loading