Title: REALISTIC EVALUATION OF SEMI-SUPERVISED LEARN-ING ALGORITHMS IN OPEN ENVIRONMENTS

Abstract: Semi-supervised learning (SSL) is a powerful paradigm for leveraging unlabeled data and has been proven to be successful across various tasks. Conventional SSL studies typically assume close environment scenarios where labeled and unlabeled examples are independently sampled from the same distribution. However, realworld tasks often involve open environment scenarios where the data distribution, label space, and feature space could differ between labeled and unlabeled data. This inconsistency introduces robustness challenges for SSL algorithms. In this paper, we first propose several robustness metrics for SSL based on the Robustness Analysis Curve (RAC), secondly, we establish a theoretical framework for studying the generalization performance and robustness of SSL algorithms in open environments, thirdly, we re-implement widely adopted SSL algorithms within a unified SSL toolkit and evaluate their performance on proposed open environment SSL benchmarks, including both image, text, and tabular datasets. By investigating the empirical and theoretical results, insightful discussions on enhancing the robustness of SSL algorithms in open environments are presented. The re-implementation and benchmark datasets are all publicly available. More details can be found at https://ygzwqzd.github.io/Robust-SSL-Benchmark.

Section: INTRODUCTION
Semi-supervised learning (SSL) aims to leverage unlabeled data to improve learning performance when labels are limited or expensive to obtain (Chapelle et al., 2006). SSL algorithms have been repeatedly reported to achieve highly competitive performance to purely supervised learning and save a lot of labeling costs, by exploring the structure of unlabeled data.
All of the positive results, however, are based on the close environment assumption where labeled and unlabeled data are sampled from the same distribution independently. However, many practical applications involve open environments (Zhou, 2022) where the data distribution, feature space, and label space could be inconsistent between labeled and unlabeled data. SSL methods suffer severe robustness problems in open environments and could be even worse than a simple supervised learning model without exploiting more unlabeled data (Guo & Li, 2018;Oliver et al., 2018;Guo et al., 2020a;Li et al., 2021). Such phenomena undoubtedly go against the expectations of SSL and limit its effectiveness in more practical tasks.
The robustness of SSL in open environments has attracted great attention in recent years and various robust SSL algorithms have been proposed from different perspectives, such as inconsistent label space (Guo et al., 2020a;Chen et al., 2020;Yu et al., 2020;Saito et al., 2021;Guo & Li, 2022;Wei et al., 2022), inconsistent data distribution (Guo et al., 2020b;Zhou et al., 2021;Huang et al., 2021;Jia et al., 2023a). However, these algorithms primarily focus on robustness from a singular perspective and overlook the utilization of practical metrics for robustness analysis. Consequently, it remains challenging to ascertain the suitability of SSL algorithms in real-world open environments.
In this paper, we first propose several metrics considering different aspects of performance in open environments to achieve a fair and comprehensive evaluation of SSL algorithms. Then, we establish a theoretical framework for studying the generalization performance and robustness of SSL algorithms, and the results show that the generalization error in SSL consists of five components: bias caused by the learner, variance caused by data sampling, and three types of inconsistencies caused by open environments. Finally, we re-implement widely adopted SSL algorithms within a unified SSL toolkit and evaluate their performance on proposed open environment SSL benchmarks, including both image, text, and tabular datasets. Some interesting findings include:
• Inconsistency between the feature and label space has a more detrimental impact compared to cases where there is inconsistency in data distribution.
• Traditional statistical SSL algorithms can often outperform deep SSL algorithms in terms of both performance and robustness when applied to tabular datasets. Thus, more advanced SSL algorithms on tabular datasets should be studied.
• Certain robust SSL algorithms currently proposed do not consistently exhibit enhanced robustness and may not surpass ordinary deep SSL algorithms in most scenarios. We argue that the robustness of SSL algorithms should be evaluated under more reasonable metrics.
• Inconsistency between labeled and unlabeled data does not invariably result in negative effects. On the contrary, leveraging inconsistent unlabeled examples may improve performance in some cases. Thus, it is important to study how to exploit helpful information from inconsistent unlabeled data.

Section: ROBUST SSL IN OPEN ENVIRONMENTS
2.1 NOTATIONS SSL algorithms leverage both labeled and unlabeled data for the learning process. In the close environments, all data generated from a consistent distribution P (x, y), x ∈ X , y ∈ Y on consistent data space X × Y ⊆ R d × {1, • • • , k} where d and k respectively represent the number of features and classes. In SSL, we are given n l labeled examples D L = {(x i , y i )|(x i , y i ) ∼ P (x, y)} n l i=1
and n u unlabeled examples D U = {x i |x i ∼ P (x)} nu i=1 where P (x) is the marginal distribution of P (x, y). The purpose is to learn a predictor with the smallest generalization error on P (x, y).
In the open environments, we assume that all examples originate from a global data space X * × Y * ⊆ R d * × {1, . . . , k * } where d * and k * respectively represent the number of features and classes that appear throughout the entire learning process. There exists an invariant data distribution P (y * |x * ) for x * ∈ X * and y * ∈ Y * . We denote the degree of inconsistency of data distribution, feature space, or label space between unlabeled and labeled data as t ∈ [0, 1]. A higher t indicates a greater inconsistency. For any t, there is an inconsistent distribution denoted as P t (x * , y * ). However, we can only obtain a projected distribution P t (x t , y t ) in a subspace X t × Y t , where X t ⊂ X * and Y t ⊂ Y * .
We denote θ(t) as the function describing the ratio of inconsistent examples in the unlabeled dataset to t. For robust SSL with any t, we are given n l labeled examples from P 0 (x, y), (1 -θ(t)) • n u consistent unlabeled examples from P 0 (x) and θ(t) • n u inconsistent unlabeled examples from P t (x).

Section: PERFORMANCE METRICS
To achieve a fair and comprehensive evaluation, we introduce a set of performance metrics tailored for robust SSL in open environments. These metrics begin by defining a function Acc(t), which quantifies the change in model accuracy as a function of the inconsistency level t. This function is used to construct the Robustness Analysis Curve (RAC) that maps the inconsistency level t to the corresponding accuracy Acc(t). Unlike traditional SSL evaluations that focus solely on Acc(0)or robust SSL evaluations that consider only a specific Acc(t), our proposed metrics are derived from the RAC and provide a more comprehensive evaluation of SSL algorithms. These metrics include Area Under the RAC Curve (AUC) which captures the overall robustness of the SSL algorithm; Expected Accuracy (EA) which describes the average performance across all inconsistency levels; Worst-Case Accuracy (WA) which identifies the lowest accuracy level, representing the worstcase scenario; Expected Variation Magnitude (EVM) which captures the average magnitude of 
, Acc) = ⟨P T , Acc⟩ = 1 0 P T (t)Acc(t)dt Worst-Case Accuracy (WA) WA(Acc) = min t∈[0,1] Acc(t) Expected Variation Magnitude (EVM) EVM(Acc) = 1 0 |Acc ′ (t)|dt Variation Stability (VS) VS(Acc) = 1 0 [Acc ′ (t) -( 1 0 Acc ′ (t)dt)] 2 dt Robust Correlation Coefficient (RCC) RCC(Acc) = 1 0 Acc(t)•tdt-1 0 Acc(t)dt √ 1 0 t 2 dt-1• √ 1 0 Acc 2 (t)dt-( 1 0 Acc(t)dt) 2
performance variation; Variation Stability (VS) which quantifies the stability of the performance variation; Robust Correlation Coefficient (RCC) which captures the overall trend of performance variation. The detailed formulation of these metrics is presented in Table 1. By considering these diverse metrics, we can provide a comprehensive evaluation of the robustness of SSL algorithms, capturing different aspects of their performance. Moreover, these metrics are not limited to accuracy and can be extended to other performance measures by replacing the function Acc(t).

Section: DEFINITION OF ROBUSTNESS SSL IN OPEN-ENVIRONMENTS
Based on the proposed metrics, we propose a formal definition of the robust SSL in open environments, including the expected robustness and worst-case robustness. Definition 1. An SSL algorithm A returns a model f t ∈ F using labeled data D L and unlabeled data D t U for any inconsistency level t where F is the hypothesis space of A. Let Acc(t) denote the accuracy of f t on the test data. When the inconsistency t follows a distribution P T (t), if there exists σ E such that |Acc(0) -EA(P T , Acc)| ≤ σ E , we say that A exhibits σ E -expected algorithmic robustness. If there exists σ W such that |Acc(0) -WA(Acc)| ≤ σ W , we say that A exhibits σ W -worst-case algorithmic robustness.
In open environments, the SSL algorithm is employed to generate models with different inconsistency ratios t. If the algorithm can consistently deliver satisfactory performance across a range of t, we deem it to exhibit expected algorithmic robustness. If the algorithm can maintain acceptable performance levels in the worst case, we consider it to demonstrate worst-case algorithmic robustness.

Section: THEORETICAL STUDIES ON ROBUST SSL
To analyze how to improve the robustness of algorithms in open environments, we establish a theoretical framework for studying the generalization performance and robustness of SSL algorithms in open environments. Specifically, we first define the projection operations Π X and Π Y to project data distributions originating from different features and label spaces onto the same spaces with labeled data. Secondly, we formally define two types of inconsistencies: feature space inconsistency Disc F and label space inconsistency Disc L , both of which represent additional generalization errors caused during the space projection process. Combined with the distribution inconsistency within the same data space Disc D defined in (Jia et al., 2023a), these constitute three types of inconsistencies in total. Finally, we analyze the SSL process in an open environment and ultimately conclude that the generalization error in SSL consists of five components: bias caused by the learner, variance caused by data sampling, and three types of inconsistencies caused by open environments. Theorem 3.1. For any target predictor f ∈ F, pseudo-label predictor h ∈ H, 0 ≤ δ 1 ≤ 1, 0 ≤ δ 2 ≤ 1 and 0 ≤ δ 3 ≤ 1, with the probability of at least (1 -δ 1 )(1 -δ 2 )(1 -δ 3 ):
E(f, P 0 (x, y)|h, w, map Xt→X0 , D L , D Ut ) ≤ n l n l + n w u t Ê(f, D L ) + n w u t n l + n w u t Ê(f, Dw Ut ) + var(F, n l + n w u t , k 0 , δ 1 ) + n w u t n l + n w u t (θ w (t)Disc L (P w t (x * ), Y 0 ) +θ w (t)Disc F (Π Y0 [P w t (x)], Π Y0 [P w t (x * )], map Xt→X0 , f ) +θ w (t)Disc D (Π X0 [Π Y0 [P w t (x * , y)]], P 0 (x, y), f )) + n w u t n l + n w u t ( Ê(h, D L ) + var(H, n l , k, δ 2 ) + var(H, n w u t , k 0 , δ 3 ) +θ w (t)Disc L (P w t (x * ), Y 0 ) + θ w (t)Disc F (Π Y0 [P t (x)], Π Y0 [P w t (x * )], map Xt→X0 , h) +θ w (t)Disc D (Π X0 [Π Y0 [P w t (x * , y)]], P 0 (x, y), h))(1)
where Ê(f, Dw Ut ) is the weighted disagreement rate between the noisy pseudo-labels and the prediction results of f on the weighted unlabeled dataset Dw
Ut .
The conclusions drawn from theoretical analysis are as follows: inconsistencies in data distribution, feature space, and label space can all harm the generalization performance of the model. To alleviate the issue of data distribution inconsistency, it is primarily dependent on aligning the distributions based on the existing predictor, thereby optimizing the term Disc D . To alleviate the issue of feature space inconsistency, it is primarily dependent on the feature mapping function, which requires the learning algorithm to accurately infer unobserved features based on the observed features, thereby optimizing the term Disc F . To alleviate the issue of label space inconsistency, it primarily relies on sample selection and weighting functions, which require robust SSL algorithms to accurately detect and mitigate the negative impact of unfavorable examples, thereby optimizing the term Disc L . The detailed theoretical analysis and proof are shown in appendix A.2.

Section: EXPERIMENTS


Section: EXPERIMENTAL SETUP
In our experiments, we evaluate both statistical SSL and deep SSL algorithms. For statistical SSL, we select 6 classical algorithms including the Semi-supervised Gaussian Mixture Model (SSGMM) (Shahshahani & Landgrebe, 1994) from the generative SSL algorithms, TSVM (Joachims et al., 1999) from the semi-supervised support vector machine algorithms, Label Propagation (Zhu & Ghahramani, 2003) and Label Spreading (Zhou et al., 2003) from the graph-based SSL algorithms, Tri-Training (Zhou & Li, 2005) from the disagreement-based SSL algorithms and Assemble (Bennett et al., 2002) from the ensemble-based SSL algorithms. For deep SSL, we select 10 representative algorithms: Pseudo Label (Lee, 2013), Pi-Model (Laine & Aila, 2017), Mean Teacher (Tarvainen & Valpola, 2017), ICT (Verma et al., 2022), VAT (Miyato et al., 2018), UDA (Xie et al., 2020), FixMatch (Sohn et al., 2020), FlexMatch (Zhang et al., 2021), FreeMatch (Wang et al., 2022b) and SoftMatch (Chen et al., 2023). We also considered 4 robust SSL algorithms: UASD (Chen et al., 2020), CAFA (Huang et al., 2021), MTCF (Yu et al., 2020), and Fix-A-Step (Huang et al., 2023).
We conduct experiments on various types of datasets, including 3 tabular datasets: iris, wine, letter; 3 image datasets: Image-CLEF (Caputo et al., 2014), CIFAR-10, and CIFAR-100; 3 text datasets: Amazon reviews (McAuley & Leskovec, 2013), IMDB movie reviews (Maas et al., 2011), and agnews (Zhang et al., 2015).
For all the experiments, we use mainstream supervised learning algorithms as baselines. For tabular data, we use XGBoost (Chen & Guestrin, 2016) as the benchmark for statistical learning algorithms and adopt FT-Transformer (Wang et al., 2022a) as the baseline and backbone for deep learning algorithms. For visual data, we use ResNet50 (He et al., 2016) as the baseline and backbone. For text data, we use the RoBERTa (Liu et al., 2019) model as the benchmark and backbone. All SSL algorithms are re-implemented based on the LAMDA-SSL toolkit (Jia et al., 2023b).
We plotted the RAC and performed statistical analysis on various evaluation metrics for different methods. For the plotting of the RAC curve, we sampled six t values [0, 0.2, 0.4, 0.6, 0.8, 1] for all open environments. To ensure reliability, we conducted three experiments for each sampling point with seed values of 0 ∼ 2. The average of these experiments was used to plot the curve. Linear interpolation was performed between adjacent sampling points. More detailed settings of the experiments are presented in appendix A.5. For image and text datasets, we adopt the natural distribution shift to simulate the inconsistent distribution between labeled and unlabeled datasets.

Section: SSL UNDER INCONSISTENT FEATURE SPACES
To simulate the inconsistent feature space, we randomly mask features for tabular data and each masked portion is filled with the mean value of the labeled data. For image datasets, we adopt the CIFAR-10 and CIFAR100 datasets and convert the images to grayscale, resulting in the loss of two color channels. For text data, we adopt the agnews (Zhang et al., 2015) dataset and employ text truncation. Truncated portions are filled with "< pad >" to simulate inconsistent feature spaces. The experimental results are reported in Table 5, Table 6, and Table 7.

Section: SSL UNDER INCONSISTENT LABEL SPACES
The inconsistent label space between labeled and unlabeled data is the most widely studied problem in robust SSL. Following previous studies (Guo et al., 2020a;Oliver et al., 2018), we construct inconsistent labeled space by randomly selecting some classes and discarding the labeled data belonging to these classes for both tabular, image, and text datasets. The experimental results are reported in Table 8, Table 9, and Table 10.

Section: EXPERIMENTAL RESULTS ANALYSIS
Based on the experimental results, we further conduct a comprehensive analysis from the perspectives of environments, algorithms, and performance metrics. Environments. We calculate the average expected robustness (under the uniform distribution of P T ) and the average worst-case robustness of SSL algorithms under different inconsistency settings.
The results are reported in Table 11. According to the definition, lower values of σ E and σ W imply stronger robustness. The results show that the robustness of SSL algorithms is much lower in cases where there is inconsistency between the feature space and the label space, compared to cases when there is inconsistency in data distribution. Actually, inconsistencies between the feature and label spaces can both be considered as a greater degree of inconsistency in data distribution. The former can be viewed as a distribution shift where all missing features are assumed to take default values, while the latter can be seen as a distribution shift where the probability of all missing classes for examples is 0. These tell us that more attention needs to be paid to feature and label inconsistency between labeled and unlabeled data.
Algorithms. We compare the robustness of different algorithms in various environments and report the results in Table 12. We found that SSGMM (Shahshahani & Landgrebe, 1994) shows the poorest robustness among all algorithms, the main reason is it relies on the assumption of data distribution. For other statistical SSL algorithms, Assemble (Bennett et al., 2002) demonstrates the best performance and remarkable robustness, showcasing the advantage of using ensemble learning. For deep SSL algorithms, we find that the reported SOTA methods such as FixMatch, FlexMatch, SoftMatch, and FreeMatch, suffer severe robustness problems. One possible reason is that these methods adopt a threshold to select pseudo-labels for unlabeled data, which might overly centralize the distribution of unlabeled data. In comparison, UDA (Xie et al., 2020) sets thresholds for both labeled and unlabeled data, mitigating the inconsistency induced by sample selection to a large extent and significantly improving the robustness over FixMatch. For robust deep SSL algorithms, we find that UASD and CAFA achieve good robustness, but for other methods, they achieve lower robustness compared with Performance Metrics. We also analyze the performance of different SSL algorithms under different metrics. First, we find that Acc(0) is not consistent with other metrics, SSL algorithms that have a high Acc(0) may perform even worse than supervised learning under the proposed robustness metrics. Second, we find that the EVM and VS metrics exhibit a high level of consistency, despite their different definitions. This indicates that for a robust SSL algorithm, its performance is less sensitive to changes in inconsistency level, and the direction of performance change is more stable and predictable. On the other hand, for a non-robust SSL algorithm, not only does it exhibit larger variations in performance, but the performance changes are also more unstable, showing greater randomness. Using such an algorithm in an open environment is extremely unsafe, as we cannot estimate its worst-case performance. Moreover, we find that the non-identically distributed unlabeled data is not always harmful, in some cases, exploiting more unlabeled data from inconsistent distributions may improve the performance. This inspires us to study SSL algorithms that fully exploit helpful information from inconsistent unlabeled data.

Section: CONCLUSION
The research on robust SSL is an essential step toward the practical application of SSL. This paper provides a reshaped perspective on problem definition, performance metrics, theoretical frameworks, and evaluation of benchmark datasets. Our results provide evidence that SSL is still not robust in open environments, especially when the feature and labeled space are inconsistent between labeled and unlabeled data. These problems are often overlooked in previous studies and more efforts need to be devoted. The subsequent details about this work will be continuously supplemented and improved. We hope that our work can help push the successes of SSL towards the real world.

Section: A APPENDIX
A.1 LIMITATIONS Although this paper has modeled and constructed complex open environments, it should be noted that the complexity of the real world may exceed the dataset we have constructed, and it may be difficult to evaluate, analyze, and explain with limited evaluation metrics and theoretical frameworks. For example, when multiple inconsistencies coexist and their degrees vary simultaneously, a highdimensional vector t rather than a one-dimensional variable t is required to represent the combination of multiple inconsistencies. While the evaluation methods and metrics we employed can be readily extended to high-dimensional cases, this leads to an exponential increase in computational resource consumption, specifically on the order of Θ(s dim(t) ), where s represents the number of examples taken for each dimension. Finding ways to reduce the evaluation complexity in scenarios involving high-dimensional inconsistencies is an urgent and unresolved issue.
A.2 THEORETICAL FRAMEWORK ON ROBUST SSL Natarajan dimension (Natarajan, 1989) is an extension of Vapnik-Chervonen dimension Vapnik & Chervonenkis (1971) in multi-classification problems. We denote N dim(H) the Natarajan dimension of a hypothesis space H. To simplify the expression, we denote the variance term associated with the hypothesis space complexity in the generalization error with the number of examples n, the number of classes k, and the probability δ:
var(H, n, k, δ) = 16N dim(H) ln √ 2nk + 8 ln 2 δ n(2)
In an open environment, we assume that all data originates from a global space
X * × Y * ⊆ R d * × {0, . . . , k * -1}.
There exists an invariant data distribution P (y * |x * ) for x * ∈ X * and y * ∈ Y * . For any t, there is an inconsistent distribution denoted as P t (x * , y * ). According to the total probability theorem, P t (x * ) = yi∈Y * [P t (y i )P t (x * |y i )] for all x * ∈ X * .
However, we can only obtain a projected distribution P t (x t , y t ) in a subspace X t × Y t from the global distribution P t (x * , y * ) in the global space X * × Y * , where X t ⊂ X * and Y t ⊂ Y * . We denote Xt = X * /X t as the unobserved features and Ȳt = Y * /Y t as the unobserved classes when the inconsistency rate is t. In this case, the observed inconsistent data follows the distribution
P t (x) = y i ∈Y t (Pt(yi)Pt(x * |yi)) pt(x|x)
, where x ∈ X t ⊆ R dt and x ∈ Xt ⊆ R d * -dt , according to ∀y i ∈ Ȳt , P t (y i ) = 0 and P t (x * ) = P t (x, x) = P t (x)P t (x|x).
In SSL, labeled data are used to train a pseudo-label predictor h ∈ H : X 0 → Y 0 where H is the hypothesis space of pseudo label predictor to obtain unlabeled dataset with pseudo-labels, denoted as DU = {(X U 1 , ỹU 1 ), (X U 2 , ỹU 2 ), . . . , (X U nu , ỹU nu )}. There is also a function w : X 0 → R used for sample weighting or selection. We denote the weighted unlabeled dataset without pseudo-labels as D w U = w(D U ) and the weighted unlabeled dataset with noisy pseudo-labels as Dw U = w( DU ). We denote the sum of weights of all unlabeled examples as n w u = (x,y)∈D U w(x). We additionally denote the proportion of inconsistent examples in the unlabeled dataset after sample weighting as
θ w (t) = nu i=(1-θ(t))nu +1 w(xi) nu i=1 w(xi)
. We define a weighted distribution as the inner product of a distribution function and a weighting function, such as P w (x) = w(x)P (x) and P w (x, y) = w(x)P (x, y).

Section: SSL algorithms use both D L and Dw
U for training a target predictor f ∈ F where F is the hypothesis space of the target predictor. Due to different feature spaces, feature mapping functions map Xt→X0 : X t → X 0 is also required to map the input into the domain of definition of the model.
For the distribution P t (x * ), we define its projection onto any target label space Y ′ as Π Y ′ [P t (x * )] = yi∈Yt I(y i ∈ Y ′ )P t (y i )P t (x * |y i ). The joint distribution after projecting onto the label space Y ′ is Π Y ′ [P t (x * , y)] = ( yi∈Yt I(y i ∈ Y ′ )P t (y i )P t (x * |y i ))P (y|x * ).
For the distribution P t (x * ), we define its projection onto any feature space X ′ as Π X ′ [P t (x * )] = P t (map X * →X ′ (x * )), x * ∼ P t (x * ). The joint distribution after projecting onto the feature space X ′ is Π X ′ [P t (x * , y)] = P t (map X * →X ′ (x * ), y), (x * , y) ∼ P t (x * , y).
1. VS: According to the definition, the metric VS is used to measure the stability of model performance changes, that is whether the derivative of Acc(t), Acc ′ (t), fluctuates significantly. We describe the magnitude of fluctuations using the variance, and thus, VS is defined as the variance of Acc ′ (t). We denote the expectation of the variable x as E(x) and the standard deviation of the variable x as σ(X).
V S(Acc)
=σ 2 (Acc ′ ) = 1 0 [Acc ′ (t) -E(Acc ′ )] 2 dt = 1 0 [Acc ′ (t) -( 1 0 Acc ′ (t)dt)] 2 dt(8)
2. RCC: According to the definition, the metric RCC is used to measure the correlation between model performance and the inconsistency factor t. The Pearson correlation coefficient effectively quantifies the correlation between two variables. We denote the covariance between the variables x and Y as COV (x) and the Pearson correlation coefficient between the variables x and Y as ρ(X, Y ).
ρ(X, Y ) = COV (X, Y ) σ(X)σ(Y ) = E(XY ) -E(X)E(Y ) E(X 2 ) -E 2 (X) E(Y 2 ) -E 2 (Y )(9)
We can directly apply the formula for the Pearson correlation coefficient.
RCC(Acc) =ρ(Acc, t) = E(Acc • t) -E(Acc)E(t) E(Acc 2 ) -E 2 (Acc) E(t 2 ) -E 2 (t) = 1 0 Acc(t) • tdt - 1 0 Acc(t)dt 1 0 tdt 1 0 Acc 2 (t)dt -( 1 0 Acc(t)dt) 2 • 1 0 t 2 dt -( 1 0 tdt) 2 = 1 0 Acc(t) • tdt - 1 0 Acc(t)dt 1 0 Acc 2 (t)dt -( 1 0 Acc(t)dt) 2 • 1 0 t 2 dt -1(10)
A.3.2 PROOF OF THEORETICAL RESULTS

Section: Proof of Theorem 1
In the case of using only n l labeled examples for supervised learning, for any h ∈ H and 0 ≤ δ 1 ≤ 1, with the probability of at least 1 -δ 1 :
E(h, P 0 (x, y)) ≤ Ê(h, D L ) + var(H, n l , k, δ 1 )(11)
where Ê(h, D L ) is the empirical error of h on the dataset D L and E(f, P 0 (x, y)) is the generalization error of f on the distribution of labeled data P 0 (x, y).
In SSL, when all examples are from the same distribution, for any t, dataset D Ut with n u examples are from the same distribution P t (x, y) = P 0 (x, y). For any h ∈ H and 0 ≤ δ 2 ≤ 1, with the probability of at least 1 -δ 2 :
Ê(h, D t U ) ≤ E(h, P t (x, y)) + var(H, n u , k, δ 2 ) = E(h, P 0 (x, y)) + var(H, n u , k, δ 2 )(12)
According to eqs. ( 11) and ( 12), for any pseudo-label predictor h ∈ H, 0 ≤ δ 1 ≤ 1 and 0 ≤ δ 2 ≤ 1, with the probability of at least (1
-δ 1 )(1 -δ 2 ): Ê(h, D Ut ) ≤ Ê(h, D L ) + var(H, n l , k, δ 1 ) + var(H, n u , k, δ 2 )(13)
When labeled data and unlabeled data are from different distributions, for any h ∈ H:
E(h, P t (x, y)) ≤E(h, P 0 (x, y)) + |P x,y∼P0(x,y) (h(x) ̸ = y) -P x,y∼Pt(x,y) (h(x) ̸ = y)| =E(h, P 0 (x, y)) + Disc(h, P 0 (x, y), P t (x, y))(14)
According to eqs. ( 11), ( 12) and ( 14), for any pseudo-label predictor h ∈ H, 0 ≤ δ 1 ≤ 1 and 0 ≤ δ 2 ≤ 1, with the probability of at least (1 -δ 1 )(1 -δ 2 ):
Ê(h, P t (x, y))
≤E(h, P 0 (x, y)) + Disc D (P 0 (x, y), P t (x, y), h)
≤ Ê(h, D L ) + var(H, n l , k, δ 1 ) + var(H, n u , k, δ 2 ) + Disc D (P 0 (x, y), P t (x, y), h)(15)
Taking into account that in SSL, a weighting function w is often used to either weigh or filter unlabeled examples, it's the weighted unlabeled data that truly plays a role in the learning process. Ê(h, D Ut , w) ≤E(h, P 0 (x, y)) + Disc D (P 0 (x, y), w(P t (x, y)), h)
≤ Ê(h, D L ) + var(H, n l , k, δ 1 ) + var(H, n w u t , k, δ 2 ) + Disc D (P 0 (x, y), P w t (x, y), h)(16)
Now considering that labeled and unlabeled data are not only from inconsistent data distributions but also inconsistent data spaces, we need an extra feature mapping function to complete the features and an extra weighting function to filter out examples from new classes. Both the mapping function and the weighting function aim to project unlabeled data to the same space as labeled data.
According to eqs. ( 3), ( 4) and ( 16), for any pseudo-label predictor h ∈ H, 0 ≤ δ 1 ≤ 1 and 0 ≤ δ 2 ≤ 1, with the probability of at least (1
-δ 1 )(1 -δ 2 ): Ê(h, w, map Xt→X0 , D Ut ) ≤ Ê(h, D L ) + var(H, n l , k 0 , δ 1 ) + var(H, n w u t , k 0 , δ 2 ) + θ w (t)Disc L (P w t (x * ), Y 0 ) +θ w (t)Disc F (Π Y0 [P w t (x)], Π Y0 [P w t (x * )], map Xt→X0 , h) +θ w (t)Disc D (P 0 (x, y), Π X0 [Π Y0 [P w t (x * , y)]], h)(17)
where Ê(h, D L ) is the empirical error of h on D L and Ê(h, w, map Xt→X0 , D Ut ) is the empirical error of h on D Ut with ground truth labels.

Section: Proof of Theorem 2
We denote the mixture of two distributions D 1 and D 2 with proportion α as:
M ix α (D 1 , D 2 ) = αD 1 + (1 -α)D 2(18)
In SSL with inconsistent distributions, the target predictor is trained with both labeled dataset D L and weighted unlabeled dataset with noisy pseudo-labels DUt . D L and DUt can be considered as a mixed dataset with n l + n ut examples from the mixed distribution M ix n l n l +nu t (P 0 (x, y), P t (x, y))
whose noisy rate is
nu t n l +nu t Ê(h, D U t ).
So, for any target predictor f ∈ F, pseudo-label predictor h ∈ H, 0 ≤ δ 3 ≤ 1, with the probability of at least 1 -δ 3 :  , 1994) assumes that data is generated by a Gaussian mixture model, that is, the marginal distribution of examples can be expressed as the result of mixing several Gaussian distributions, and each distribution is given a weight. 2. TSVM (Joachims et al., 1999) infers labels of unlabeled examples and finds dividing hyperplanes that maximize the distance from support vectors. It continuously finds pairs of unlabeled heterogeneous examples and exchanges their labels until no more pairs can be found. 3. Label Propagation (Zhu & Ghahramani, 2003) uses examples as nodes, and the relationship between the examples as edges. The purpose of the Label Propagation algorithm is to propagate the labels from labeled data to unlabeled data through the graph. 4. Label Spreading (Zhou et al., 2003) penalizes misclassified labeled examples rather than banning misclassification completely which is different from Label Propagation fixing labels of labeled examples during the spreading process. 5. Tri-Training (Zhou & Li, 2005) is a representative disagreement-based SSL algorithm. It uses three learners with divergence and makes divergence by data sampling. The disagreement between learners is utilized for optimizing interactively. 6. Assemble (Bennett et al., 2002) extents AdaBoost to the field of SSL by giving pseudo-labels to unlabeled data. It pays more attention to the examples with poor learning effects of the current ensemble learner in each round and continuously improves the robustness using new base learners.
E(f, M ix n l n l +nu t (P 0 (x, y), P t (x, y))|h, D L , D U ) ≤ n l n l + n ut Ê(f, D L ) + n ut n l + n ut Ê(f, DU ) + var(F, n l + n ut t , k, δ 3 ) + n ut n l + n ut Ê(h, D U ) (19) ≤ n l n l + n w ut Ê(f, D L ) + n w ut n l + n w ut Ê(f
A.4.2 DEEP SSL ALGORITHMS 1. Pseudo Label (Lee, 2013) takes the label with the highest confidence as the pseudo-label and uses cross-entropy obtained from the pseudo-label as the unsupervised loss. 2. Pi-Model (Laine & Aila, 2017) augments the data twice randomly and uses the results of the two augmentations as inputs of the neural network respectively. The inconsistency of the prediction results is used as the unsupervised loss. 3. Mean Teacher (Tarvainen & Valpola, 2017) relies on the idea of knowledge distillation, where the prediction results of the teacher model are used as pseudo-labels to train the student model to ensure the consistency of the prediction results. It uses EMA for the student model's parameters as the teacher model. 4. VAT (Miyato et al., 2018) adds adversarial noise rather than random noise to the data so that the worst performance of the model can be better when the data is affected by noise within a certain range, which corresponds to the zero-sum game in game theory and Min-Max problem in optimization. 5. ICT (Verma et al., 2022) linearly interpolates data and prediction results by Mixup. The unsupervised loss is obtained by the interpolation consistency.
6. UDA (Xie et al., 2020) performs data augmentation on the unlabeled examples and then compares the prediction results before and after the augmentation. The thresholds are used for sample selection for both labeled and unlabeled data respectively. 7. FixMatch (Sohn et al., 2020) uses both strong and weak data augmentation and the inconsistency of prediction results between them is used as the unsupervised loss. A fixed threshold is used for sample selection. 8. FlexMatch (Zhang et al., 2021) uses a dynamic threshold based on FixMatch. It sets a lower confidence threshold for the classes that are more difficult to learn. 9. FreeMatch (Wang et al., 2022b) employs a more precise dynamic threshold, where the threshold setting takes into account both the model's training phase and the disparities between categories. It also incorporates a regularization term to facilitate equitable predictions between categories. 10. SoftMatch (Chen et al., 2023)  3. IMDB-Amazon: The IMDB and Amazon datasets can be considered as source and target domains respectively. All source domain data can be regarded as being obtained by sampling from P source (x, y) and all target domain data can be regarded as being obtained by sampling from P target (x, y). We set P 0 (x, y) = P source (x, y) and P t (x, y) = P target (x, y) for all t ̸ = 0. 

Section: ACKNOWLEDGEMENTS
This research was supported by he National Science Foundation of China (62176118, 62306133).

Section: Published as a conference paper at ICLR 2024
We define the discrepancy between the distribution P t (x * ) and the label space Y ′ as:
(3)
For the data distribution P t (x * ) and the observed data P t (x), we define their discrepancy on an arbitrary feature space X ′ with respect to the feature mapping function map Xt→X ′ and the model function f defined on the domain of X ′ as Disc F (P t (x), P t (x * ), map Xt→X ′ , f ) = |P (x * ,y)∈Pt(x * ,y) (f (map X * →X ′ (x)) ̸ = y)
-P (x,y)∈Pt(x,y) (f (map Xt→X ′ (x)) ̸ = y)|.
For two data distributions P ′ (x, y) and P ′′ (x, y) defined on the same feature space and label space, their distributional difference to the model function f can be defined by the following discrepancy:
Disc D (P ′ (x, y), P ′′ (x, y), f ) = |P (x,y)∼P ′ (x,y) (f (x) ̸ = y) -P (x,y)∼P ′′ (x,y) (f (x) ̸ = y)| (5) As a result, we can obtain the error rate of pseudo-labeling in the weighted or filtered unlabeled dataset obtained through robust SSL.
Theorem A.1. For any pseudo-label predictor h ∈ H, 0 ≤ δ 1 ≤ 1 and 0 ≤ δ 2 ≤ 1, with the probability of at least (1 -δ 1 )(1 -δ 2 ):
Ê(h, w, map Xt→X0 , D Ut )
≤ Ê(h, D L ) + var(H, n l , k 0 , δ 1 ) + var(H, n w u t , k 0 , δ 2 ) + θ w (t)Disc L (P w t (x * ), Y 0 ) +θ w (t)Disc F (Π Y0 [P w t (x)], Π Y0 [P w t (x * )], map Xt→X0 , h) +θ w (t)Disc D (Π X0 [Π Y0 [P w  t (x * , y)]], P 0 (x, y), h)
where Ê(h, D L ) is the empirical error of h on D L and Ê(h, w, map Xt→X0 , D Ut ) is the empirical error of h on D U with ground truth labels.
Based on the above label noise rate bound of the unlabeled dataset, we can estimate the generalization error bound of the robust SSL algorithm trained with the labeled dataset and this unlabeled dataset. Theorem A.2. For any target predictor f ∈ F, pseudo-label predictor h ∈ H, 0 ≤ δ 1 ≤ 1, 0 ≤ δ 2 ≤ 1 and 0 ≤ δ 3 ≤ 1, with the probability of at least (1
where Ê(f, Dw Ut ) is the weighted disagreement rate between the noisy pseudo-labels and the prediction results of f on the weighted unlabeled dataset Dw Ut .

Section: A.3 FORMULA DERIVATION AND THEORETICAL PROOF


Section: A.3.1 DERIVATION OF EVALUATION METRICS
Since the formulas for the metrics AUC, EA, WA, and EVM can be directly obtained through definitions, we primarily focus on deriving the formulas for the metrics VS and RCC.
where E(f, M ix n l n l +nu t (P 0 (x, y), P t (x, y))|h, D L , D U ) is the generalization error of f on the distribution M ix n l n l +nu t (P 0 (x, y), P t (x, y) corresponding to pseudo-label predictor h.
When labeled data and unlabeled data are from different distributions, for any f ∈ F:
According to eqs. ( 16), ( 19) and ( 20), for any target predictor f ∈ F, pseudo-label predictor h ∈ H,
+Disc D (f, P 0 (x, y), M ix n l n l +nu t (P 0 (x, y), P t (x, y)))
+ n ut n l + n ut ( Ê(h, D L ) + var(H, n l , k, δ 2 ) + var(H, n ut , k, δ 3 ) + Disc D (h, P 0 (x, y), P t (x, y)))
where Ê(f, DUt ) is the weighted empirical inconsistency rate between the noisy pseudo-labels and the prediction results of f on the unlabeled dataset DUt .
Taking into account inconsistent label spaces and weighting function w:
+Disc D (f, P 0 (x, y), M ix n l n l +n w u t (P 0 (x, y), P w t (x, y)))
Taking into account inconsistent feature spaces and mapping function map Xt→X0 , the final error bound can be obtained.
According to eqs. ( 3), ( 4), ( 17), ( 21) and ( 22), for any target predictor f ∈ F, pseudo-label predictor h ∈ H, 0 ≤ δ 1 ≤ 1, 0 ≤ δ 2 ≤ 1 and 0 ≤ δ 3 ≤ 1, with the probability of at least (1
Robust Deep SSL Algorithms 1. UASD: the ratio of unsupervised loss λ u is set to 1.0, and the threshold is set to 0.95.
2. CAFA: the base SSL algorithm used is Pi Model, the warmup rate of unsupervised loss w u is set to 4 15 , The perturbation magnitude ϵ is set to 0.014 and the Beta distribution parameter α is set to 0.75, the warmup rate of adversarial loss w a is set to 8  15 , the ratio of unsupervised loss λ u is exp(-5
) and the ratio of adversarial loss λ a is exp(-5
) in the t-th iteration where T is the number of iterations. 3. MTCF: the ratio of unsupervised loss λ u is set to 75, the temperature T is set to 0.5, and the parameter of Beta distribution in Mixup is set to 0.75.

Section: Fix-A-
Step: the parameter of Beta distribution in Mixup is set to 0.75, FixMatch is set to the base SSL method, and all the hyperparameters are the same as FixMatch.
Data Augmentation 1. Agnews and IMDB/Amazon: the weak and strong augmentations are synonyms replacements with 1 and 5 words respectively.
2. wine, iris, letter: the weak and strong augmentations are Gaussian noise with 0.1 and 0.2 rates respectively.
3. CIFAR10, CIFAR100, Image-CLEF: the weak augmentation is RandomHorizontalFlip, and the strong augmentation is RandAugment.
Others Hyper-Parameters 1. batch size: the batch size for the IMDB/Amazon dataset is 8, the batch size for the Agnews dataset is 16, the batch size for the Image-CLEF dataset is 32, the batch size for CIFAR10 and CIFAR100 is 64, the batch size for tabular datasets is 64.
2. iteration: the iteration for Image-CLEF dataset is 2000, the iteration for the tabular dataset is set to 1000, the iteration for ag news and IMDB/Amazon is set to 5000, the iteration for CIFAR10 and CIFAR100 is 100000.
3. optimizer: the optimizer for all datasets is SGD with a learning rate of 5e-4 and a momentum of 0.9.
4. scheduler: the scheduler for all datasets is CosineWarmup with num cycles 7/16.

Section: A.6 RESULTS UNDER INCONSISTENT DISTRIBUTION
The experimental results can be referenced in tables 13 to 17.

Section: A.7 RESULTS UNDER INCONSISTENT FEATURE SPACE
The experimental results can be referenced in tables 18 to 20.

Section: A.8 RESULTS UNDER INCONSISTENT LABEL SPACE
The experimental results can be referenced in tables 21 to 23. 


References:
[b0] Kristin P Bennett; Ayhan Demiriz; Richard Maclin (2002). Exploiting unlabeled data in ensemble methods. 
[b1] Barbara Caputo; Henning Müller; Jesus Martinez-Gomez; Mauricio Villegas; Burak Acar; Novi Patricia; Neda Marvasti; Suzan Üsküdarlı; Roberto Paredes; Miguel Cazorla (2014). Imageclef 2014: Overview and analysis of the results. 
[b2] Olivier Chapelle; Bernhard Scholkopf; Alexander Zien (2006). Semi-Supervised Learning. The MIT Press
[b3] Ran Hao Chen; Yue Tao; Yidong Fan; Jindong Wang; Bernt Wang; Xing Schiele; Bhiksha Xie; Marios Raj;  Savvides (2023). Softmatch: Addressing the quantity-quality trade-off in semi-supervised learning. 
[b4] Tianqi Chen; Carlos Guestrin (2016). Xgboost: A scalable tree boosting system. 
[b5] Yanbei Chen; Xiatian Zhu; Wei Li; Shaogang Gong (2020). Semi-supervised learning under class distribution mismatch. 
[b6] Lan-Zhe Guo; Yu-Feng Li (2018). A general formulation for safely exploiting weakly supervised data. 
[b7] Lan-Zhe Guo; Yu-Feng Li (2022). Class-imbalanced semi-supervised learning with adaptive thresholding. 
[b8] Lan-Zhe Guo; Zhen-Yu Zhang; Yuan Jiang; Yu-Feng Li; Zhi-Hua Zhou (2020). Safe deep semisupervised learning for unseen-class unlabeled data. 
[b9] Lan-Zhe Guo; Zhi Zhou; Yu-Feng Li (2020). Record: Resource constrained semi-supervised learning under distribution shift. 
[b10] Kaiming He; Xiangyu Zhang; Shaoqing Ren; Jian Sun (2016). Deep residual learning for image recognition. 
[b11] Zhe Huang; Mary-Joy Sidhom; Benjamin Wessler; Michael C Hughes (2023). Fix-a-step: Semisupervised learning from uncurated unlabeled data. 
[b12] Zhuo Huang; Chao Xue; Bo Han; Jian Yang; Chen Gong (2021). Universal semi-supervised learning. 
[b13] Lin-Han Jia; Lan-Zhe Guo; Zhi Zhou; Shao Jiejing; Yu Ke-Xiang; Yu-Feng Li (2023). Bidirectional adaptation for robust semi-supervised learning with inconsistent data distributions. 
[b14] Lin-Han Jia; Lan-Zhe Guo; Zhi Zhou; Yu-Feng Li (2023). Lamda-ssl: Semi-supervised learning in python. 
[b15] Thorsten Joachims (1999). Transductive inference for text classification using support vector machines. 
[b16] Samuli Laine; Timo Aila (2017). Temporal ensembling for semi-supervised learning. 
[b17] Dong-Hyun Lee (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. 
[b18] Yu-Feng Li; Lan-Zhe Guo; Zhi-Hua Zhou (2021). Towards safe weakly supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence
[b19] Yinhan Liu; Myle Ott; Naman Goyal; Jingfei Du; Mandar Joshi; Danqi Chen; Omer Levy; Mike Lewis; Luke Zettlemoyer; Veselin Stoyanov (2019). Roberta: A robustly optimized bert pretraining approach. 
[b20] Andrew Maas; Raymond E Daly; Peter T Pham; Dan Huang; Andrew Y Ng; Christopher Potts (2011). Learning word vectors for sentiment analysis. 
[b21] Julian Mcauley; Jure Leskovec (2013). Hidden factors and hidden topics: understanding rating dimensions with review text. 
[b22] Takeru Miyato; Shin-Ichi Maeda; Masanori Koyama; Shin Ishii (2018). Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence
[b23]  Balas K Natarajan (1989). On learning sets and functions. Machine Learning
[b24] Avital Oliver; Augustus Odena; Colin A Raffel; Ekin Dogus Cubuk; Ian Goodfellow (2018). Realistic evaluation of deep semi-supervised learning algorithms. 
[b25] Kuniaki Saito; Donghyun Kim; Kate Saenko (2021). Openmatch: Open-set semi-supervised learning with open-set consistency regularization. 
[b26] M Behzad; David A Shahshahani;  Landgrebe (1994). The effect of unlabeled samples in reducing the small sample size problem and mitigating the hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing
[b27] Kihyuk Sohn; David Berthelot; Nicholas Carlini; Zizhao Zhang; Han Zhang; Colin A Raffel; Ekin Dogus Cubuk; Alexey Kurakin; Chun-Liang Li (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. 
[b28] Antti Tarvainen; Harri Valpola (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. 
[b29]  Vn;  Vapnik; Chervonenkis Ya (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability Its Applications
[b30] Vikas Verma; Kenji Kawaguchi; Alex Lamb; Juho Kannala; Arno Solin; Yoshua Bengio; David Lopez-Paz (2022). Interpolation consistency training for semi-supervised learning. Neural Networks
[b31] Feiyu Wang; Qin Wang; Wen Li; Dong Xu; Luc Van Gool (2022). Revisiting deep semi-supervised learning: An empirical distribution alignment framework and its generalization bound. 
[b32] Yidong Wang; Hao Chen; Qiang Heng; Wenxin Hou; Yue Fan; Zhen Wu; Jindong Wang; Marios Savvides; Takahiro Shinozaki; Bhiksha Raj (2022). Freematch: Self-adaptive thresholding for semi-supervised learning. 
[b33] Tong Wei; Qian-Yu Liu; Jiang-Xin Shi; Wei-Wei Tu; Lan-Zhe Guo (2022). Transfer and share: semisupervised learning from long-tailed data. 
[b34] Qizhe Xie; Zihang Dai; Eduard Hovy; Thang Luong; Quoc Le (2020). Unsupervised data augmentation for consistency training. 
[b35] Qing Yu; Daiki Ikami; Go Irie; Kiyoharu Aizawa (2020). Multi-task curriculum framework for open-set semi-supervised learning. 
[b36] Bowen Zhang; Yidong Wang; Wenxin Hou; Hao Wu; Jindong Wang; Manabu Okumura; Takahiro Shinozaki (2021). Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. 
[b37] Xiang Zhang; Junbo Zhao; Yann Lecun (2015). Character-level convolutional networks for text classification. 
[b38] Dengyong Zhou; Olivier Bousquet; Thomas Lal; Jason Weston; Bernhard Schölkopf (2003). Learning with local and global consistency. 
[b39] Zhi Zhou; Lan-Zhe Guo; Zhanzhan Cheng; Yu-Feng Li; Shiliang Pu (2021). Step: Out-of-distribution detection in the presence of limited in-distribution labeled data. 
[b40] Zhi-Hua Zhou (2022). Open-environment machine learning. National Science Review
[b41] Zhi-Hua Zhou; Ming Li (2005). Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering
[b42] Xiaojin Zhu; Zoubin Ghahramani (2003). Semi-supervised learning using gaussian fields and harmonic functions. 
[b43]  (). Agnews: (k + 1)/2 classes of all examples are used as source domain data and the rest are used as target domain data. 100 examples of the source domain are used as labeled data. For inconsistency rate t, the unlabeled dataset is combined with n t * (1 -t) examples for the source domain and n t * t examples from the target domain where n t is the number of target domain examples. We set P 0 (x, y) = P source (x, y) and P t (x, y) = P target (x, y). 
[b44]  (). FT-Transformers: the number of layers is set to 8, the dimension of tokens is set to 192, and the number of heads is set to 8. 
[b45]  (). ResNet50: the Resnet50 pre-trained on ImageNet from torchvision.models is directly used. 
[b46] Roberta  (). the pre-trained model "roberta-base" from transformers package is directly used. 
[b47]  (). TSVM: the parameter C l is set to 15. 
[b48]  (). Label Propagation: the hyperparameters provided by scikit-learn in default are used. 
[b49]  (). Label Spreading: the hyperparameters provided by scikit-learn in default are used. 
[b50]  (). Tri-Training: all the base learners are set to XGBoost classifier consistent with the baseline. 
[b51]  (). Assemble: the number of iterations T is set to 30, and all the base learners are set to XGBoost classifier. 
[b52]  (). Pi-Model: the ratio of unsupervised loss λ u is set to 1.0, the warmup rate of unsupervised loss w u is set to 0.4, and the ratio of unsupervised loss λ u is set to max( t T •w , 1.0). 
[b53]  (). Mean Teacher: the EMA decay is set to 0.999. 
[b54]  (). VAT: the ratio of unsupervised loss λ u is set to 0.3, the ratio of entropy minimization loss λ entmin is set to 0.06. 
[b55]  (). ICT: the ratio of unsupervised loss λ u is set to 100, and the parameter of Beta distribution in Mixup is set to 0. 
[b56]  (). UDA: the ratio of unsupervised loss λ u is set to 1.0, the threshold is set to 0.8. 
[b57]  (). FixMatch: the ratio of unsupervised loss λ u is set to 1.0, the threshold is set to 0.95. 
[b58]  (). FlexMatch: the ratio of unsupervised loss λ u is set to 1.0. 
[b59]  (). FreeMatch: the ratio of unsupervised loss λ u is set to 1.0, the EMA decay is set to 0.999. 
[b60]  (). SoftMatch: the ratio of unsupervised loss λ u is set to 1.0. 

Figures:
Figure tab_0: 1
Type: table
Caption: Performance Metrics for Robust Semi-Supervised Learning in Open Environments. Acc(t) describe the change in model accuracy with the inconsistency extent t, P T (t) is the distribution for t, Acc ′ (•) indicate the first derivative.
Data: Area Under the Curve (AUC)AUC(Acc) =1 0 Acc(t)dtExpected Accuracy (EA)EA(P T

Figure tab_1: 2
Type: table
Caption: Evaluation of SSL algorithms using letter dataset under inconsistent data distributions
Data: ModelAUC Acc(0)WAEVMVSRCCXGBoost0.643 0.643 0.643---TSVM0.624 0.650 0.607 0.012 0.012 -0.733SSGMM0.276 0.334 0.245 0.022 0.029 -0.740Label Propagation 0.524 0.629 0.486 0.029 0.037 -0.833Label Spreading0.588 0.653 0.563 0.020 0.025 -0.780Tri-Training0.625 0.689 0.600 0.018 0.024 -0.851Assemble0.644 0.646 0.641 0.003 0.003 0.037FT-Transformer0.660 0.660 0.660---Pseudo Label0.658 0.667 0.652 0.003 0.002 -0.977Pi-Model0.683 0.698 0.673 0.005 0.003 -0.961Mean Teacher0.687 0.687 0.687 0.000 0.000-VAT0.702 0.725 0.686 0.008 0.005 -0.969ICT0.677 0.677 0.677 0.000 0.000-UDA0.693 0.693 0.693 0.000 0.000-FixMatch0.644 0.739 0.588 0.032 0.022 -0.947FlexMatch0.649 0.687 0.618 0.014 0.010 -0.973FreeMatch0.474 0.629 0.406 0.050 0.053 -0.813SoftMatch0.584 0.654 0.564 0.020 0.032 -0.664UASD0.701 0.702 0.700 0.001 0.002 0.008CAFA0.658 0.659 0.657 0.001 0.001 -0.266MTCF0.365 0.612 0.270 0.081 0.102 -0.702Fix-A-Step0.682 0.739 0.642 0.019 0.013 -0.976

Figure tab_2: 3
Type: table
Caption: Evaluation of deep SSL methods using ImageNet/Caltech dataset.
Data: DatasetModelAUC Acc(0)WAEVMVSRCCSupervised0.909 0.909 0.909---Pseudo Label 0.907 0.908 0.907 0.001 0.001 -0.621Pi-Model0.909 0.907 0.907 0.001 0.001 0.655Mean Teacher 0.903 0.904 0.900 0.003 0.003 0.169VAT0.888 0.881 0.881 0.002 0.002 0.928ICT0.907 0.909 0.903 0.003 0.004 -0.359UDA0.896 0.904 0.891 0.006 0.007 -0.512ImageNet/CaltechFixMatch0.902 0.905 0.887 0.005 0.007 -0.726FlexMatch0.906 0.921 0.893 0.008 0.010 -0.861FreeMatch0.864 0.916 0.832 0.031 0.028 -0.786SoftMatch0.904 0.908 0.891 0.007 0.007 -0.805UASD0.897 0.897 0.897 0.000 0.000-CAFA0.893 0.892 0.889 0.002 0.002 0.820MTCF0.880 0.904 0.855 0.016 0.015 -0.841Fix-A-Step0.869 0.876 0.856 0.007 0.011 -0.347

Figure tab_3: 4
Type: table
Caption: Evaluation on IMDB/Amazon dataset with 100 labels under inconsistent data distributions
Data: DatasetModelAUC Acc(0)WAEVMVSRCCSupervised0.571 0.571 0.571---Pseudo Label 0.634 0.545 0.545 0.084 0.092 0.296Pi-Model0.597 0.615 0.535 0.051 0.056 0.504Mean Teacher 0.601 0.570 0.538 0.096 0.101 -0.012IMDB/AmazonUDA FixMatch0.599 0.523 0.523 0.071 0.080 0.484 0.530 0.540 0.500 0.027 0.031 -0.604FlexMatch0.545 0.502 0.502 0.064 0.071 0.169FreeMatch0.537 0.591 0.502 0.036 0.035 -0.615SoftMatch0.532 0.553 0.513 0.019 0.020 -0.347UASD0.593 0.541 0.541 0.046 0.055 0.580

Figure tab_4: 5
Type: table
Caption: Evaluation of SSL algorithms using letter dataset under inconsistent feature space
Data: ModelAUC Acc(0)WAEVMVSRCCXGBoost0.694 0.694 0.694---TSVM0.683 0.721 0.635 0.017 0.005 -0.991SSGMM0.412 0.503 0.309 0.039 0.008 -0.996Label Propagation 0.589 0.642 0.542 0.020 0.012 -0.978Label Spreading0.668 0.695 0.598 0.019 0.022 -0.857Tri-Training0.696 0.716 0.664 0.010 0.008 -0.945Assemble0.675 0.675 0.671 0.004 0.005 0.341FT-Transformer0.490 0.490 0.490---Pseudo Label0.534 0.538 0.532 0.002 0.002 -0.401Pi-Model0.541 0.552 0.537 0.003 0.004 -0.761Mean Teacher0.517 0.517 0.517 0.000 0.000-VAT0.541 0.561 0.535 0.007 0.008 -0.720ICT0.540 0.540 0.540 0.000 0.000-UDA0.537 0.537 0.537 0.000 0.000-FixMatch0.499 0.548 0.470 0.022 0.035 -0.237FlexMatch0.435 0.470 0.406 0.020 0.029 -0.015FreeMatch0.447 0.409 0.409 0.014 0.008 0.977SoftMatch0.501 0.536 0.475 0.020 0.022 -0.231UASD0.552 0.553 0.549 0.001 0.002 -0.530CAFA0.511 0.511 0.510 0.001 0.001 -0.358MTCF0.415 0.278 0.278 0.042 0.034 0.932Fix-A-Step0.511 0.561 0.490 0.020 0.032 -0.3114.2 SSL

Figure tab_5: 6
Type: table
Caption: Evaluation on CIFAR10 dataset under inconsistent feature spaces
Data: DatasetMethodAUC Acc(0)WAEVMVSRCCSupervised0.473 0.473 0.473---Pseudo Label 0.519 0.524 0.515 0.002 0.003 -0.874Pi-Model0.500 0.511 0.485 0.007 0.007 -0.882Mean Teacher 0.470 0.486 0.457 0.006 0.005 -0.962VAT0.501 0.550 0.466 0.020 0.018 -0.880ICT0.468 0.476 0.456 0.005 0.005 -0.929UDA0.498 0.505 0.438 0.019 0.025 -0.707CIFAR10FixMatch0.517 0.551 0.430 0.037 0.042 -0.661FlexMatch0.552 0.607 0.431 0.041 0.039 -0.921FreeMatch0.555 0.645 0.423 0.045 0.029 -0.962SoftMatch0.559 0.661 0.453 0.042 0.009 -0.998UASD0.481 0.486 0.479 0.003 0.003 -0.625CAFA0.484 0.502 0.469 0.007 0.003 -0.988MTCF0.496 0.625 0.316 0.107 0.130 -0.604Fix-A-Step0.516 0.551 0.424 0.025 0.032 -0.832

Figure tab_6: 7
Type: table
Caption: Evaluation on Agnews under inconsistent feature spaces
Data: DatasetMethodAUC Acc(0)WAEVMVSRCCSupervised0.844 0.844 0.844---Pseudo Label 0.849 0.847 0.844 0.005 0.006 0.480Pi-Model0.865 0.870 0.859 0.003 0.003 -0.874Mean Teacher 0.851 0.856 0.841 0.004 0.004 -0.890AgnewsUDA FixMatch0.844 0.862 0.802 0.022 0.029 -0.686 0.870 0.880 0.858 0.005 0.005 -0.944FlexMatch0.848 0.877 0.810 0.021 0.021 -0.829FreeMatch0.876 0.872 0.868 0.008 0.009 -0.131SoftMatch0.875 0.880 0.865 0.005 0.005 -0.815UASD0.849 0.854 0.837 0.010 0.012 -0.007

Figure tab_7: 8
Type: table
Caption: Evaluation of SSL algorithms using letter dataset under inconsistent label space
Data: ModelAUC Acc(0)WAEVMVSRCCXGBoost0.694 0.694 0.694---TSVM0.683 0.721 0.635 0.017 0.005 -0.991SSGMM0.412 0.503 0.309 0.039 0.008 -0.996Label Propagation 0.589 0.642 0.542 0.02 0.012 -0.978Label Spreading0.668 0.695 0.598 0.019 0.022 -0.857Tri-Training0.696 0.716 0.664 0.010 0.008 -0.945Assemble0.675 0.675 0.671 0.004 0.005 0.341FT-Transformer0.628 0.628 0.628---Pseudo Label0.628 0.634 0.620 0.003 0.002 -0.970Pi-Model0.649 0.653 0.639 0.005 0.007 -0.673Mean Teacher0.635 0.635 0.635 0.000 0.000-VAT0.640 0.656 0.622 0.007 0.004 -0.984ICT0.607 0.607 0.607 0.000 0.000-UDA0.606 0.605 0.605 0.001 0.001 0.657FixMatch0.602 0.663 0.556 0.021 0.012 -0.983FlexMatch0.644 0.662 0.621 0.008 0.006 -0.975FreeMatch0.528 0.634 0.447 0.042 0.041 -0.937SoftMatch0.638 0.657 0.613 0.009 0.008 -0.946UASD0.638 0.640 0.628 0.005 0.007 0.138CAFA0.620 0.622 0.615 0.002 0.002 -0.918MTCF0.547 0.668 0.417 0.050 0.027 -0.984Fix-A-Step0.648 0.668 0.614 0.013 0.008 -0.966

Figure tab_8: 9
Type: table
Caption: Evaluation on CIFAR10 under inconsistent label spaces
Data: DatasetMethodAUC Acc(0)WAEVMVSRCCSupervised0.643 0.643 0.643---Pseudo Label 0.692 0.708 0.676 0.006 0.004 -0.973Pi-Model0.672 0.703 0.654 0.01 0.009 -0.937Mean Teacher 0.639 0.647 0.634 0.003 0.005 -0.333VAT0.697 0.734 0.661 0.015 0.011 -0.974ICT0.643 0.647 0.642 0.002 0.002 -0.819UDA0.6760.730.594 0.027 0.015 -0.963CIFAR10FixMatch0.608 0.705 0.479 0.047 0.036 -0.933FlexMatch0.731 0.806 0.614 0.038 0.02 -0.965FreeMatch0.733 0.815 0.640 0.035 0.012 -0.994SoftMatch0.723 0.806 0.601 0.041 0.021 -0.968UASD0.644 0.641 0.641 0.002 0.002 0.404CAFA0.675 0.674 0.672 0.005 0.006 0.093MTCF0.747 0.798 0.681 0.024 0.008 -0.989Fix-A-Step0.681 0.757 0.517 0.048 0.048 -0.908

Figure tab_9: 10
Type: table
Caption: Evaluation on Agnews under inconsistent label spaces
Data: DatasetMethodAUC Acc(0)WAEVMVSRCCSupervised0.961 0.961 0.961---Pseudo Label 0.960 0.956 0.956 0.007 0.006 0.307Pi-Model0.962 0.968 0.950 0.006 0.006 -0.785Mean Teacher 0.965 0.964 0.961 0.004 0.004 0.261AgnewsUDA FixMatch0.956 0.965 0.938 0.010 0.009 -0.816 0.957 0.974 0.927 0.012 0.009 -0.902FlexMatch0.937 0.973 0.889 0.011 0.017 -0.975FreeMatch0.936 0.972 0.811 0.036 0.056 -0.752SoftMatch0.961 0.974 0.939 0.012 0.012 -0.862UASD0.954 0.948 0.944 0.013 0.014 0.112

Figure tab_10: 11
Type: table
Caption: Average Robustness of SSL Algorithms in Different Environments Environments Excepted Robustness (σ E ) Worst-case Robustness (σ W )
Data: Inconsistent Data Distributions0.0150.028Inconsistent Feature Spaces0.0190.039Inconsistent Label Spaces0.0200.044Table 12: Average Robustness of SSL AlgorithmsAlgorithmsExcepted Robustness (σ E ) Worst-case Robustness (σ W )SSGMM0.0620.120TSVM0.0170.040Label Propagation0.0300.053Label Spreading0.0210.045Tri-Training0.0240.041Assemble0.0090.017Pseudo Label0.0080.014Pi-Model0.0120.021Mean Teacher0.0140.027VAT0.0340.065ICT0.0090.022UDA0.0200.066FixMatch0.0650.164FlexMatch0.0550.143FreeMatch0.0660.157SoftMatch0.0670.154UASD0.0030.002CAFA0.0100.022MTCF0.0770.118Fix-A-Step0.0620.197ordinary SSL algorithms. Therefore, when designing a robust SSL algorithm, we need to considermore comprehensive environments and evaluation metrics.

Figure tab_11: 
Type: table
Caption: , Dw Ut )) + var(F, n l + n w ut , k 0 , δ 1 ) ), Y 0 ) + θ w (t)Disc F (Π Y0 [P w t (x)], Π Y0 [P w t (x * )], map Xt→X0 , f )
Data: + t (x  +θ w (t)Disc D (Π X0 [Π Y0 [P w n w ut n l + n w ut (θ w (t)Disc L (P w t (x  *  , y)]], P 0 (x, y), f ))+n w ut n l + n w ut( Ê(h, D L ) + var(H, n l , k, δ 2 ) + var(H, n w ut , k 0 , δ 3 )+θ w (t)Disc L (P w t (x  *  ), Y 0 ) + θ w (t)Disc F (Π Y0 [P t (x)], Π Y0 [P w t (x  *  )], map Xt→X0 , h)+θ w (t)Disc D (Π X0 [Π Y0 [P w t (x  *  , y)]], P 0 (x, y), h))(23)where Ê(f, Dw Ut ) is the weighted disagreement rate between the noisy pseudo-labels and the predic-tion results of f on the weighted unlabeled dataset Dw Ut .A.4 SSL ALGORITHMS EVALUATED IN THE BENCHMARKA.4.1 STATISTICAL SSL ALGORITHMS1. SSGMM (Shahshahani & Landgrebe

Figure tab_12: 
Type: table
Caption: no longer adheres to the paradigm of filtering examples through confidence threshold, and instead replaces sample selection with sample weighting. The sample weights are utilized to achieve a better balance between the quantity and quality of pseudo-labeled data.Chen et al., 2020) ensembles model predictions to produce probability predictions for unlabeled examples, and uses threshold based on confidence to filter out OOD examples. 2. CAFA(Huang et al., 2021) takes into account both the inconsistency in labeling spaces and data distributions. It employs a scoring mechanism to filter out examples from new classes and then utilizes unsupervised domain adaptation to alleviate distribution inconsistency, thus obtaining higher-quality pseudo-labels. 3. MTCF(Yu et al., 2020) leverages the concept of curriculum learning. It uses a joint optimization framework, which updates the network parameters and the OOD score alternately to detect the OOD examples and achieve high performance on the classification simultaneously. 4. Fix-A-Step (Huang et al., 2023) views all OOD unlabeled examples as potentially helpful. To construct datasets with inconsistent distributions, in each class, we calculate the center of all examples and sort these examples according to the distance between them and the center in ascending order. The first n c * 0.5 examples are used as labeled data which can be which can be regarded as being obtained by sampling from P 0 (x, y). and the rest of n c * 0.5 examples are used as inconsistent unlabeled data. For each t, the n c * 0.5 * (t -1 s ) to n c * 0.5 * t examples are used as labeled data which can which can be regarded as being obtained by sampling from P t (x). θ(t) = 1 for every t. 5 examples per class from source domain data are used as labeled data and the rest are used as test data.2. Image-CLEF: The Image-CLEF dataset consists of 3 domains, which can be combined into 6 source-domain to target-domain pairs. All source domain data can be regarded as being obtained by sampling from P source (x, y) and all target domain data can be regarded as being obtained by sampling from P target (x, y). We set P 0 (x, y) = P source (x, y) and P t (x, y) = P target (x, y) for all t ̸ = 0. From the source-domain data, 100 examples are taken as labeled data. Half of the remaining source-domain examples are used as test data, while the other half is combined with the target-domain data to form an unlabeled dataset.
Data: A.4.3 ROBUST DEEP SSL ALGORITHMS1. UASD (Itmodifies gradient descent updates to prevent optimizing a multi-task SSL loss from hurtinglabeled-set accuracy.A.5 EXPERIMENTSA.5.1 DATASETS PREPARATIONInconsistent Data Distribution1. Wine, Iris, Letter:

Figure tab_13: 
Type: table
Caption: From the source-domain data, 100 examples are taken as labeled data. Half of the remaining source-domain examples are used as test data, while the other half is combined with the target-domain data to form an unlabeled dataset. The total number of unlabeled data n u is min(0.5 * (n s -100), n t ) where n s is the number of examples in the source domain and n t is the number of examples in the target domain. θ(t) = t for every t. For inconsistency rate t, the unlabeled dataset is combined with n u * (1 -t) examples for the source domain and n u * t examples from the target domain. Inconsistent Feature Space 1. Wine, Iris, Letter: 50% of all examples can be used as source domain data, and the rest are used as target domain data. 5 examples per class of source domain data are used as labeled data which can be regarded as being obtained by sampling from P 0 (x, y), and the rest are used as test data. For every t, All target domain data randomly dropping t * d features are used as unlabeled data which can be regarded as being obtained by sampling from P t (x, y). θ(t) = 1 for every t. 2. CIFAR10, CIFAR100: 50% of all examples can be used as source domain data which can be regarded as being obtained by sampling from P source (x, y), and the rest are used as target domain data. 20 examples per class of source domain data are used as labeled data. All target domain data are transformed to grey images by dropping 2 channels which can be regarded as being obtained by sampling from P target (x, y). We set P 0 (x, y) = P source (x, y) and P t (x, y) = P target (x, y) for all t ̸ = 0. θ(t) = t for every t. For inconsistency rate t, the unlabeled dataset is combined with n u * (1 -t) examples for the source domain and n u * t examples from the target domain. 3. Agnews: 50% of all examples can be used as source domain data which can be regarded as being obtained by sampling from P source (x, y), and the rest are used as target domain data. 100 examples of source domain are used as labeled data and the rest are used as test data. 50% target domain sentences are used as IID examples and the other 50% target domain sentences that drop 50% tokens are used as OOD examples which can be regarded as being obtained by sampling from P target (x, y). We set P 0 (x, y) = P source (x, y) and P t (x, y) = P target (x, y) for all t ̸ = 0. The number of unlabeled data n u is set to min(n I /(1 -t), n O /t where n I and n D are the numbers of IID and OOD examples respectively. The unlabeled dataset is combined with n u * (1 -t) IID and n u * t OOD examples. θ(t) = t for every t. 50% of all examples can be used as source domain data, and the rest are used as target domain data. (k + 1)//2 classes of source data are saved and the rest examples are dropped which can be regarded as being obtained by sampling from P source (x, y). 5 examples per class of saved source domain data are used as labeled data and the rest are used as test data. The target domain examples with saved classes are used as OOD examples which can be regarded as being obtained by sampling from P target (x, y), and the target examples with dropped classes are used as IID examples. The number of unlabeled data n u is set to min(n I /(1 -t), n O /t) where n I and n D are the numbers of IID and OOD examples respectively. The unlabeled dataset is combined with n u * (1 -t) IID and n u * t OOD examples. We set P 0 (x, y) = P source (x, y) and P t (x, y) = P target (x, y) for all t ̸ = 0. θ(t) = t for every t. 2. CIFAR10, CIFAR100: (k + 1)/2 classes of all examples are used as source domain data, and the rest are used as target domain data. 20 examples per class of the source domain are used as labeled data. For inconsistency rate t, the unlabeled dataset is combined with n t * (1 -t) examples for the source domain and n t * t examples from the target domain where n t is the number of target domain examples. We set P 0 (x, y) = P source (x, y) and P
Data: Inconsistent Label Space1. Wine, Iris, Letter:


Formulas:
Formula formula_0: , Acc) = ⟨P T , Acc⟩ = 1 0 P T (t)Acc(t)dt Worst-Case Accuracy (WA) WA(Acc) = min t∈[0,1] Acc(t) Expected Variation Magnitude (EVM) EVM(Acc) = 1 0 |Acc ′ (t)|dt Variation Stability (VS) VS(Acc) = 1 0 [Acc ′ (t) -( 1 0 Acc ′ (t)dt)] 2 dt Robust Correlation Coefficient (RCC) RCC(Acc) = 1 0 Acc(t)•tdt-1 0 Acc(t)dt √ 1 0 t 2 dt-1• √ 1 0 Acc 2 (t)dt-( 1 0 Acc(t)dt) 2

Formula formula_1: E(f, P 0 (x, y)|h, w, map Xt→X0 , D L , D Ut ) ≤ n l n l + n w u t Ê(f, D L ) + n w u t n l + n w u t Ê(f, Dw Ut ) + var(F, n l + n w u t , k 0 , δ 1 ) + n w u t n l + n w u t (θ w (t)Disc L (P w t (x * ), Y 0 ) +θ w (t)Disc F (Π Y0 [P w t (x)], Π Y0 [P w t (x * )], map Xt→X0 , f ) +θ w (t)Disc D (Π X0 [Π Y0 [P w t (x * , y)]], P 0 (x, y), f )) + n w u t n l + n w u t ( Ê(h, D L ) + var(H, n l , k, δ 2 ) + var(H, n w u t , k 0 , δ 3 ) +θ w (t)Disc L (P w t (x * ), Y 0 ) + θ w (t)Disc F (Π Y0 [P t (x)], Π Y0 [P w t (x * )], map Xt→X0 , h) +θ w (t)Disc D (Π X0 [Π Y0 [P w t (x * , y)]], P 0 (x, y), h))(1)

Formula formula_2: var(H, n, k, δ) = 16N dim(H) ln √ 2nk + 8 ln 2 δ n(2)

Formula formula_3: X * × Y * ⊆ R d * × {0, . . . , k * -1}.

Formula formula_4: P t (x) = y i ∈Y t (Pt(yi)Pt(x * |yi)) pt(x|x)

Formula formula_5: θ w (t) = nu i=(1-θ(t))nu +1 w(xi) nu i=1 w(xi)

Formula formula_6: =σ 2 (Acc ′ ) = 1 0 [Acc ′ (t) -E(Acc ′ )] 2 dt = 1 0 [Acc ′ (t) -( 1 0 Acc ′ (t)dt)] 2 dt(8)

Formula formula_7: ρ(X, Y ) = COV (X, Y ) σ(X)σ(Y ) = E(XY ) -E(X)E(Y ) E(X 2 ) -E 2 (X) E(Y 2 ) -E 2 (Y )(9)

Formula formula_8: RCC(Acc) =ρ(Acc, t) = E(Acc • t) -E(Acc)E(t) E(Acc 2 ) -E 2 (Acc) E(t 2 ) -E 2 (t) = 1 0 Acc(t) • tdt - 1 0 Acc(t)dt 1 0 tdt 1 0 Acc 2 (t)dt -( 1 0 Acc(t)dt) 2 • 1 0 t 2 dt -( 1 0 tdt) 2 = 1 0 Acc(t) • tdt - 1 0 Acc(t)dt 1 0 Acc 2 (t)dt -( 1 0 Acc(t)dt) 2 • 1 0 t 2 dt -1(10)

Formula formula_9: E(h, P 0 (x, y)) ≤ Ê(h, D L ) + var(H, n l , k, δ 1 )(11)

Formula formula_10: Ê(h, D t U ) ≤ E(h, P t (x, y)) + var(H, n u , k, δ 2 ) = E(h, P 0 (x, y)) + var(H, n u , k, δ 2 )(12)

Formula formula_11: -δ 1 )(1 -δ 2 ): Ê(h, D Ut ) ≤ Ê(h, D L ) + var(H, n l , k, δ 1 ) + var(H, n u , k, δ 2 )(13)

Formula formula_12: E(h, P t (x, y)) ≤E(h, P 0 (x, y)) + |P x,y∼P0(x,y) (h(x) ̸ = y) -P x,y∼Pt(x,y) (h(x) ̸ = y)| =E(h, P 0 (x, y)) + Disc(h, P 0 (x, y), P t (x, y))(14)

Formula formula_13: ≤ Ê(h, D L ) + var(H, n l , k, δ 1 ) + var(H, n u , k, δ 2 ) + Disc D (P 0 (x, y), P t (x, y), h)(15)

Formula formula_14: ≤ Ê(h, D L ) + var(H, n l , k, δ 1 ) + var(H, n w u t , k, δ 2 ) + Disc D (P 0 (x, y), P w t (x, y), h)(16)

Formula formula_15: -δ 1 )(1 -δ 2 ): Ê(h, w, map Xt→X0 , D Ut ) ≤ Ê(h, D L ) + var(H, n l , k 0 , δ 1 ) + var(H, n w u t , k 0 , δ 2 ) + θ w (t)Disc L (P w t (x * ), Y 0 ) +θ w (t)Disc F (Π Y0 [P w t (x)], Π Y0 [P w t (x * )], map Xt→X0 , h) +θ w (t)Disc D (P 0 (x, y), Π X0 [Π Y0 [P w t (x * , y)]], h)(17)

Formula formula_16: M ix α (D 1 , D 2 ) = αD 1 + (1 -α)D 2(18)

Formula formula_17: nu t n l +nu t Ê(h, D U t ).

Formula formula_18: E(f, M ix n l n l +nu t (P 0 (x, y), P t (x, y))|h, D L , D U ) ≤ n l n l + n ut Ê(f, D L ) + n ut n l + n ut Ê(f, DU ) + var(F, n l + n ut t , k, δ 3 ) + n ut n l + n ut Ê(h, D U ) (19) ≤ n l n l + n w ut Ê(f, D L ) + n w ut n l + n w ut Ê(f

