Title: Leveraging sparse and shared feature activations for disentangled representation learning

Abstract: Research on recovering the latent factors of variation of high dimensional data has so far focused on simple synthetic settings. Mostly building on unsupervised and weakly-supervised objectives, prior work missed out on the positive implications for representation learning on real world data. In this work, we propose to leverage knowledge extracted from a diversified set of supervised tasks to learn a common disentangled representation. Assuming that each supervised task only depends on an unknown subset of the factors of variation, we disentangle the feature space of a supervised multi-task model, with features activating sparsely across different tasks and information being shared as appropriate. Importantly, we never directly observe the factors of variations, but establish that access to multiple tasks is sufficient for identifiability under sufficiency and minimality assumptions. We validate our approach on six real world distribution shift benchmarks, and different data modalities (images, text), demonstrating how disentangled representations can be transferred to real settings.

Section: Introduction
A fundamental question in deep learning is how to learn meaningful and reusable representation from high dimensional data observations [8,75,78,77]. A core area of research pursuing is centered on disentangled representation learning (DRL) [56,8,33] where the aim is to learn a representation which recovers the factors of variations (FOVs) underlying the data distribution. Disentangled representations are expected to contain all the information present in the data in a compact and interpretable structure [46,16] while being independent from a particular task [29]. It has been argued that separating information into interventionally independent factors [78] can enable robust downstream predictions, which was partially validated in synthetic settings [19,58]. Unfortunately, these benefits did not materialize in real world representations learning problems, largely limited by a lack of scalability of existing approaches.
In this work we focus on leveraging knowledge from different task objectives to learn better representations of high dimensional data, and explore the link with disentanglement and out-of-distribution (OOD) generalization on real data distributions. Representations learned from a large diversity of tasks are indeed expected to be richer and generalize better to new, possibly out-of-distribution, tasks. However, this is not always the case, as different tasks can compete with each other and lead to weaker models. This phenomenon, known as negative transfer [61,91] in the context of transfer learning or task competition [83] in multitask learning, happens when a limited capacity model is used to learn two different tasks that require expressing high feature variability and/or coverage. Aiming to use the same features for different objectives makes them noisy and often increases the sensitivity to spurious correlations [35,27,7], as features can be both predictive and detrimental for different tasks. Instead, we leverage a diverse set of tasks and assume that each task only depends on an unknown subset of the factors of variation. We show that disentangled representations naturally emerge without any annotation of the factors of variations under the following two representation constraints:
• Sparse sufficiency: Features should activate sparsely with respect to tasks. The representation is sparsely sufficient in the sense that any given task can be solved using few features.
• Minimality: Features are maximally shared across tasks whenever possible. The representation is minimal in the sense that features are encouraged to be reused, i.e., duplicated or split features are avoided.
These properties are intuitively desirable to obtain features that (i) are disentangled w.r.t. to the factors of variations underlying the task data distribution (which we also theoretically argue in Proposition 2.1), (ii) generalize better in settings where test data undergo distribution shifts with respect to the training distributions, and (iii) suffer less from problems related to negative transfer phenomena. To learn such representations in practice, we implement a meta learning approach, enforcing feature sufficiency and sharing with a sparsity regularizer and an entropy based feature sharing regularizer, respectively, incorporated in the base learner. Experimentally, we show that our model learns meaningful disentangled representations that enable strong generalization on real world data sets. Our contributions can be summarized as follows:
• We demonstrate that is possible to learn disentangled representations leveraging knowledge from a distribution of tasks. For this, we propose a meta learning approach to learn a feature space from a collection of tasks while incorporating our sparse sufficiency and minimality principles favoring task specific features to coexist with general features.
• Following previous literature, we test our approach on synthetic data, validating in an idealized controlled setting that our sufficiency and minimality principles lead to disentangled features w.r.t. the ground truth factors of variation, as expected from our identifiability result in Proposition 2.1.
• We extend our empirical evaluation to non-synthetic data where factors of variations are not known, and show that our approach generalizes well out-of-distribution on different domain generalization and distribution shift benchmarks.

Section: Method
Given a distribution of tasks t ∼ T and data (x t , y t ) ∼ P t for each task t, we aim to learn a disentangled representation g(x) = ẑ ∈ Ẑ ⊆ R M , which generalizes well to unseen tasks. We learn this representation g by imposing the sparse sufficiency and minimality inductive biases.

Section: Learning sparse and shared features
Our architecture (see Figure 1) is composed of a backbone module g θ that is shared across all tasks and a separate linear classification head f ϕt , which is specific to each task t. The backbone is responsible to compute and learn a general feature representation for all classification tasks. The linear head solves a specific classification problem for the task-specific data (x t , y t ) ∼ P t in the feature space Ẑ while enforcing the feature sufficiency and minimality principles. Adopting the typical meta-learning setting [34], the backbone module g θ can be viewed as the meta learner while the task-specific classification heads f ϕt can be viewed as the base learners. In the meta-learning setting we assume to have access to samples for a new task given by a support set U , with elements (x U , y U ) ∈ U . These samples are used to fit the linear head f ϕ * leading to the optimal feature weights for the given task. For a query x Q ∈ Q, the prediction is obtained by computing the forward pass ŷ = f ϕ * (g θ (x Q )).
Enforcing feature minimality and sufficiency. To solve a task in the feature space Ẑ of the backbone module we impose the following regularizer Reg(ϕ) on the classification heads f ϕ with parameter ϕ ∈ R T ×M ×C , where T is the number of tasks, M the number of features, and C the number of classes. The regularizer is responsible for enforcing the feature minimality and sufficiency 
x U g θ f ϕ ẑU g θ ŷU L inner x Q g θ f ϕ * ϕ * ẑQ g θ ŷQ L outer
with scalar weights α and β. The penalty terms are defined by:
Reg L1 (ϕ) = 1 T C t,c,m |ϕ t,m,c |(2)
Reg sharing (ϕ) = H( φm ) = - m φm log( φm )(3)
where φm = 1

Section: T C
t,c |ϕt,c,m| t,c,m |ϕt,c,m| are the normalized classifier parameters. Sufficiency is enforced by a sparsity regularizer given by the L 1 -norm, which constrains classification head to use only a sparse subset of the features. Minimality is enforced by the feature sharing term: minimizing the entropy of the distribution of feature importances (i.e. normalized |ϕ t |) averaged across a mini batch of T tasks, leads to a more peaked distribution of activations across tasks. This forces features to cluster across tasks and therefore be reused by different tasks, when useful.We remark that different choices for the regularizers coming from the linear multitask learning literature (e.g. [59,39,38]) to enforce sparse sufficiency and minimality are indeed possibile. We leave their exploration as a future direction.

Section: Training method
We train the model in meta-learning fashion by minimizing the test error over the expectation of the task distribution t ∼ T . This can be formalized as a bi-level optimization problem. The optimal backbone model g θ * is given by the outer optimization problem:
min θ E t [L outer (f ϕ * (g θ (x Q t ), y Q t ))],(4)
where f ϕ * are the optimal classifiers obtained from solving the inner optimization problem, and (x Q t , y Q t ) ∈ Q t are the test (or query) datum from the query set Q t for task t. Let U t be the support set with samples (x U t , y U t ) ∈ U for task t, where typically the support set is distinct from the query set, i.e., U ∩ Q = ∅. The optimal classifiers f ϕ * are given by the inner optimization problem:
min ϕ 1 T t L inner (ŷ U t , y U t ) + Reg(ϕ),(5)
where ŷU t = f ϕ (g θ (x U t ). For both the inner loss L inner and outer loss L outer we use the cross entropy loss.
Task generation. Our method can be applied in a standard supervised classification setting where we construct the tasks on the fly as follows. We define a task t as a C-way classification problem. We first select a random subset of C classes from a training domain D train which contains K train classes. For each class we consider the corresponding data points and select a random support set U t with elements (x U t , y U ) ∈ U and a disjoint random query set
Q t with elements (x Q t , y Q ) ∈ Q t . Algorithm.
In practice we solve the bi-level optimization problem ( 4) and ( 5) as follows. In each iteration we sample a batch of T tasks with the associated support and query set as described above. First, we use the samples from the support set S t to fit the linear heads f ϕ by solving the inner optimization problem (5) using stochastic gradient descent for a fixed number of steps. Second, we use the samples from the query set Q t to update the backbone g θ by solving the outer optimization problem (4) using implicit differentiation [11,31]. Since the optimal solution of the linear heads ϕ * depend on the backbone g θ , a straightforward differentiation w.r.t. θ is not possible. We remedy this issue by using the approximation strategy of [28] to compute the implicit gradients. The algorithm is summarized in section B.1 of the Appendix.

Section: Theoretical analysis
We analyze the implications of the proposed minimality and sparse sufficiency principles and show in a controlled setting that they indeed lead to identifiability. As outlined in Figure 2, we assume that there exists a set of independent latent factors z ∼ d i=1 p(z i ) that generate the observations via an unknown mixing function x = g * (z). Additionally, we assume that the labels y t for a task t only depend on a subset of the factors indexed by S t ∼ P (S), where S is an index set on z ∈ Z, via some unknown mixing function y t = f * t (z) (potentially different for different tasks). We formalize the two principles that are imposed on f * by: 1. sufficiency:
f * t = f * t | St for S t ∼ p(S) 2. minimality: ̸ ∃S ′ ̸ = S t ⊂ S s.t. f * t | S ′ = f * t ,
where f | St denotes that the input to a function f is restricted to the index set given by S t (all remaining entries are set to zero). ( 1) states that f * t only uses a subset of features, and (2) states that there are not be duplicate features. Proposition 2.1. Assume that g * is a diffeomorphism (smooth with smooth inverse), f * satisfies the sufficiency and minimality properties stated above, and p(S) satisfies:
p(S ∩ S ′ = {i}) > 0 or p({i} ∈ (S ∪ S ′ ) -(S ′ ∩ S)) > 0.
Observing unlimited data from p(X, Y ), it is possible to recover a representation ẑ that is an axis aligned, component wise transformation of z.
Remarks: Overall, we see this proposition as validation that in an idealized setting our inductive biases are sufficient to recover the factors of variation. Note that the proof is non-constructive and does not entail a specific method. In practice, we rely on the same constraints as inductive biases that lead to this theoretical identifiability and experimentally show that disentangled representations emerge in controlled synthetic settings. On real data, (1) we cannot directly measure disentanglement, (2) a notion of global ground-truth factors may even be ill-posed, and (3) the assumptions of Proposition 2.1 are likely violated. Still, sparse sufficiency and minimality yield some meaningful factorization of the representation for the considered tasks.
Relation to [47] and [58]: Our theoretical result can be reconnected with concurrent work [47] and can be seen as a corollary with a different proof technique and slightly relaxed assumptions. The main difference is that our feature minimality allows us to also cover the case where the number of factors of variations is unknown, which we found critical in real world data sets (the main focus of our paper). Instead, they only assume sparse sufficiency, which is enough for identifiability if the ground-truth number of factors is known, but is not enough to recover high disentaglement when this is not the case (see Figure 3) and does not translate well to real data, see Table 16 with the empirical comparison in Appendix D.8. Interestingly, their analysis also hints at the fact that our approach also benefits in terms of sample complexity on transfer learning downstream tasks. Our proof technique follows the general construction developed for multi-view data in [58], adapted to our different setting. Instead of observing multiple views with shared factors of variation, we observe a single task that only depend on a subset of the factors.

Section: Related work
Learning from multiple tasks and domains. Our method addresses the problem of learning a general representation across multiple and possibly unseen tasks [15,103] and environments [105,32,44,97,63,94,64] that may be competing with each other during training [61,91,83]. Prior research tackled task competition by introducing task specific modules that do not interact during training [67,101,80]. While successfully learning specialized modules, these approaches can not leverage synergistic information between tasks, when present. On the other hand, our approach is closer to multi-task methods that aim at learning a generalist model, leveraging multi-task interactions [106,5]. Other approaches that leverage a meta-learning objective for multi-task learning have been formulated [18,81,50,9]. In particular, [50] proposes to learn a generalist model in a few-shot learning setting without explicitly favoring feature sharing, nor sparsity. Instead, we rephrase the multi-task objective function encoding both feature sharing and sparsity to avoid task competition.
Similar to prior work in domain generalization, we assume the existence of stable features for a given task [64,4,86,40,90] and amortize the learning over the multiple environments. Differently than prior work, we do not aim to learn an invariant representation a priori. Instead, we learn sufficient and minimal features for each task, which are selected at test time fitting the linear head on them. In light of [32], one can interpret our approach as learning the final classifier using empirical risk minimization but over features learned with information from the multiple domains.
Disentangled representations. Disentanglement representation learning [8,33] aims at recovering the factors of variations underlying a given data distribution. [56] proved that without any form of supervision (whether direct or indirect) on the Factors of Variation (FOV) is not possible to recover them. Much work has then focused on identifiable settings [58,25] from non-i.i.d. data, even allowing for latent causal relations between the factors. Different approaches can be largely grouped in two categories. First, data may be non-independently sampled, for example assuming sparse interventions or a sparse latent dynamics [30,55,13,100,2,79,48]. Second, data may be non-identically distributed, for example being clustered in annotated groups [37,41,82,95,60]. Our method follows the latter, but we do not make assumptions on the factor distribution across tasks (only their relevance in terms of sufficiency and minimality). This is also reflected in our method, as we train for supervised classification as opposed to contrastive or unsupervised learning as common in the disentanglement literature. The only exception is the work of [47] discussed in Section 2.3.

Section: Experiments
We start by highlighting here the experimental setup of this paper along with its motivation. Synthetic experiments. We first evaluate our method on benchmarks from the disentanglement literature [62,14,71,49] where we have access to ground-truth annotations and we can assess quantitatively how well we can learn disentangled representations. We further investigate how minimality and feature sharing are correlated with disentanglement measures (Section 4.1) and how well our representations, which are learned from a limited set of tasks, generalize their composition. The purpose of these experiments is to validate our theoretical statement, showing that if the assumptions of Proposition 2.1 hold, our methods quantitatively recover the factors of variation. Domain generalization. On real data sets, we can neither quantitatively measure disentanglement nor are we guaranteed identifiability (as assumptions may be violated). Ultimately, the goal of disentangled representations is to learn features that lend themselves to be easily and robustly transferred to downstream tasks. Therefore, we first evaluate the usefulness of our representations with respect to downstream tasks subject to distribution shifts, where isolating spurious features was found to improve generalization in synthetic settings [19,58] To assess how robust our representations are to distribution shifts, we evaluate our method on domain generalization and domain shift tasks on six different benchmarks (Section 4.2). In a domain generalization setting, we do not have access to samples coming from the testing domain, which is considered to be OOD w.r.t. to the training domains. However, in order to solve a new task, our method relies on a set labeled data at test time to fit the linear head on top of the feature space. Our strategy is to sample data points from the training distribution, balanced by class, assuming that the label set Y does not change in the testing domain, although its distribution may undergo subpopulation shifts.
Few-shot transfer learning. Lastly, we test the adaptability of the feature space to new domains with limited labeled samples. For transfer learning tasks, we fit a linear head using the available limited supervised data. The sparsity penalty α is set to the value used in training; the feature sharing parameter β is defaulted to zero unless specified.
Experimental setting. To have a fair comparison with other methods in the literature, we adopt the standard experimental setting of prior work [32,44]. Hyperparameters α and β are tuned performing model selection on validation set, unless specified otherwise. For comparison with baselines, we substitute our backbone with that of the baseline (e.g. for ERM models, we detach the classification head) and then fit a new linear head on the same data. The linear head module trained at test time on top of the features is the same both for our and compared methods. Despite its simplicity, we report the ERM baseline for comparison in our experiments in the main paper, since it has been shown to perform best in average on domain generalization benchmarks [32,44]. We further compare with other consolidated approaches in the literature such as IRM [4], CORAL [85] and GroupDRO [73] and include a large and comprehensive comparison with [99,10,51,53,26,54,65,102,36,45] in AppendixD. 4. Experimental details are fully described in Appendix C.

Section: Synthetic experiments
We start by demonstrating that our approach is able to recover the factors of variation underlying a synthetic data distribution like [62]. For these experiments, we assume to have partial information on a subset of factors of variation Z, and we aim to learn a representation ẑ that aligns with them while ignoring any spurious factors that may be present. We sample random tasks from a distribution T (see Appendix C.3 for details) 5and focus on binary tasks, with Y = {0, 1}. For the DSprites dataset an example of valid task is "There is a big object on the left of the image". In this case, the partially observed factors (quantized to only two values) are the x position and size. In Table 1, we show how the feature sufficiency and minimality properties enable disentanglement in the learned representations. We train two identical models on a random distribution of sparse tasks defined on FOVs, showing that, for different datasets [62,14,49,71], the same model without regularizers achieves a similar in-distribution (ID) accuracy, but a much lower disentanglement. [47] Figure 3: Role of minimality: We plot the DCI metric of a set of models (red dots) trained on fixed tasks from DSprites: Training without regularizers leads to no disentanglement (green). Enforcing sparsity alone (yellow, akin to [47]) achieves good disentanglement (DCI = 71.9), but some features may be split or duplicated. Enforcing both minimality and sparse sufficiency (magenta) attains the best DCI (98.8). When β is too high (> 0.25) activated features collapses into few clusters with respect to tasks. For complete results and experiments on additional datasets see Table 8 and Figures 6,7 in Appendix.
We then randomly draw and fix 2 groups of tasks with supports S 1 , S 2 (18 in total), which all have support on two FOVs, |S 1 | = |S 2 | = 2. The groups share one factor of variation and differ in the other one, i.e. S 1 ∩ S 2 = {i} for some {i} ∈ Z. The data in these tasks are subject to spurious correlations, i.e. FOVs not in the task support may be spuriously correlated with the task label. We start from an overestimate of the dimension of z of 6, trying to recover z of size 3. We train our network to solve these tasks, enforcing sufficiency and minimality on the representation with different regularization degrees. In Figure 3, we show how the alignment of the learned features with the ground truth factors of variations depend on the choice of α, β, going from no disentanglement (DCI = 27.8). to good alignment as we enforce more sufficiency and minimality. The model that attains the best alignment (DCI = 98.8) uses both sparsity and feature sharing. Sufficiency alone (akin to the method of [47]) is able to select the right support for each task, but features are split or duplicated, attaining lower disentanglement (DCI = 71.9). The feature sharing penalty ensures clustering in the feature space w.r.t. tasks, ensuring to reach high disentanglement, although it may result in the failure cases, when β is too high (β > 0.25).
Table 1: Enforcing disentanglement: DCI [22] disentanglement scores and ID accuracy on test samples for a model trained without enforcing sufficiency and minimality (top row), and model with the regularizers activated (bottom row). While attaining similar performance on accuracy, the model with the activated regularizer always show higher disentanglement. See Table 7 for additional scores.  9 in Appendix.
Disentanglement and minimality are correlated. In the synthetic setting, we also show the role of the feature sharing penalty. Minimizing the entropy of feature activations across mini-batches of tasks results in clusters in the feature space. We investigate how the strength of this penalty correlates well with disentanglement metrics [22] training different models on Dsprites which differ by the value of β. For 15 models trained increasing β from 0 to 0.2 linearly, we observe a correlation coefficient with the DCI metric associated to representations compute by each model of 94.7, showing that the feature sharing property strongly encourages disentanglement. This confirms again that sufficiency alone (i.e. enforcing sparsity) is not enough to attain good disentanglement.
Task compositional generalization. Finally, we evaluate the generalization capabilities of the features learned by our method by testing our model on a set of unseen tasks obtained by combining tasks seen during training. To do this, we first train two models on the AbstractDSprites dataset using a random distribution of tasks, where we limit the support of each task to be within 2 (i.e. |S| = 2). The models differ in activating/deactivating the regularizers on the linear heads. Then, we test on 100 tasks drawn from a distribution with increasing support on the factors of variation (|S| = 3, |S| = 4, |S| = 5), which correspond to composition of tasks in the training distribution; see Figure 4, with the accompaning Table 9 in Appendix D.

Section: Domain Generalization
In this section we evaluate our method on benchmarks coming from the domain generalization field [32,93,70] and subpopulation distribution shifts [73,44], to show that a feature space learned with our inductive biases performs well out of real world data distribution. Subpopulation shifts. Subpopulation shifts occur when the distribution of minority groups changes across domains. Our claim is that a feature space that satisfies sparse sufficiency and minimality is more robust to spurious correlations which may affect minority groups, and should transfer better to new distributions. To validate this, we test on two benchmarks Waterbirds [73], and CivilComments [44] (see Appendix C.1).
For both, we use the train and test split of the original dataset. In Table 4, last row, we report the results on the test set of Waterbirds for the different groups in the dataset (landbirds on land, landbirds on water, waterbirds on land, and waterbirds on water, respectively). We fit the linear head on a random subset of the training domain, balanced by class, repeat 10 times and report accuracy and standard deviation on test. For CivilComments we report the average and worst accuracy in Figure 5, where we compare with ERM and groupDRO [73]. While performing almost on par w.r.t. ERM, our method is more robust to spurious correlation in the dataset, showing the higher worst group accuracy. Importantly, we outperform GroupDRO, which uses information on the subdomain statistics, while we do not assume any prior knowledge about them. Results per group are reported in the Appendix (Table 11).  Camelyon17. The model is trained according to the original splits in the dataset. In Table 3 we report the accuracy of our model on in-distribution and OOD splits, compared with different baselines [84,4]. Our method shows the best performance on the OOD test domains. The intuition of why this happens is that, due to minimality, we retain more features which are shared across the three training domains, giving less importance to the ones that are domain-specific (which contain the spurious correlations with the hospital environmental informations). This can be further enforced at test time, as we show in the ablation in Appendix D.9, trading off in distribution performance for OOD accuracy. We finally show the ability of features learned with our method to adapt to a new domain with a small number of samples in a few-shot setting. We compare the results with ERM in Table 2, averaged by domains in each benchmark dataset. The full scores for each domain are in Appendix D.5 for 1-shot, 5-shot, and 10-shot setting, reporting the mean accuracy and standard deviations over 100 draws. Our approach achieves consistently higher accuracy than ERM, showing the better adaptation capabilities of our minimal and sufficently sparse feature space.

Section: Additional results
In Appendix D we report a large collection of additional results, including comparison with 14 baseline methods on the domain shift benchmarks (D.4), a qualitative and quantitative analysis on the minimality and sparse sufficiency properties in the real setting (D.2), a favorable additional comparison on meta learning benchmarks, with 6 other baselines including [47](D.8), an ablation study on the effect of clustering features at test time (D.9), and a demonstration on the possibility to obtain a task similarity measure as a consequence of our approach (D.7).

Section: Conclusions
In this paper, we demonstrated how to learn disentangled representations from a distribution of tasks by enforcing feature sparsity and sharing. We have shown this setting is identifiable and have validated it experimentally in a synthetic and controlled setting. Additionally, we have empirically shown that these representations are beneficial for generalizing out-of-distribution in real-world settings, isolating spurious and domain specific factors that should not be used under distribution shift.
Limitations and future work: The main limitation of our work is the global assumption on the strength of the sparsity and feature sharing regularizers α and β across all tasks. In real settings these properties of the representations might need to change for different tasks. We have already observed this in the synthetic setting in Figure 3, where when β > 0.25 features cluster excessively and are unable to achieve clear disentanglement and do not generalize well. Future work may exploit some level of knowledge on the task distribution (e.g. some measure of distance on tasks) in order to tune α, β adaptively during training, or to train conditioning on a distribution of regularization parameters as in [21], enabling more generalization at test time. Another limitation is in the sampling procedure to fit the linear head at test time: sampling randomly from the training set (balanced by class) may not be enough to achieve the best performance under distributions shifts. Alternative sampling procedures, e.g. ones that incorporate knowledge on the distribution shift if available (as in [43]), may lead to better performance at test time.

Section: Acknowledgments and Disclosure of Funding
Marco Fumero and Emanuele Rodolà were supported by the ERC grant no.802554 (SPECGEO), PRIN 2020 project no.2020TA3K9N (LEGO.AI), and PNRR MUR project PE0000013-FAIR. Marco Fumero and Francesco Locatello were partially at Amazon while working at this project. We thank Julius von Kügelgen, Sebastian Lachapelle and the anonymous reviewers for their feedback and suggestions.


References:
[b0] Julius Adebayo; Justin Gilmer; Michael Muelly; Ian J Goodfellow; Moritz Hardt; Been Kim (2018-12-03). Sanity checks for saliency maps. 
[b1] Kartik Ahuja; Karthikeyan Shanmugam; R Kush; Amit Varshney;  Dhurandhar (2020-07). Invariant risk minimization games. PMLR
[b2] Isabela Albuquerque; João Monteiro; Mohammad Darvishi; H Tiago; Ioannis Falk;  Mitliagkas (2019). Generalizing to unseen domains via distribution matching. 
[b3] Martin Arjovsky; Léon Bottou; Ishaan Gulrajani; David Lopez-Paz (2019). Invariant risk minimization. 
[b4] Jinze Bai; Rui Men; Hao Yang; Xuancheng Ren; Kai Dang; Yichang Zhang; Xiaohuan Zhou; Peng Wang; Sinan Tan; An Yang (2022). Ofasys: A multi-modal multi-task learning system for building generalist models. 
[b5] Peter Bandi (). Camelyon17 dataset. 
[b6] Sara Beery; Grant Van Horn; Pietro Perona (2018). Recognition in terra incognita. 
[b7] Yoshua Bengio; Aaron Courville; Pascal Vincent (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence
[b8] Luca Bertinetto; João F Henriques; H S Philip; Andrea Torr;  Vedaldi (2019). Meta-learning with differentiable closed-form solvers. 
[b9] Gilles Blanchard; Aniket Anand Deshmukh; Ürun Dogan; Gyemin Lee; Clayton Scott (2021). Domain generalization by marginal transfer learning. The Journal of Machine Learning Research
[b10] Mathieu Blondel; Quentin Berthet; Marco Cuturi; Roy Frostig; Stephan Hoyer; Felipe Llinares-López; Fabian Pedregosa; Jean-Philippe Vert (2021). Efficient and modular implicit differentiation. 
[b11] Daniel Borkan; Lucas Dixon; Jeffrey Sorensen; Nithum Thain; Lucy Vasserman (2019). Nuanced metrics for measuring unintended bias with real data for text classification. 
[b12] Johann Brehmer; Pim De Haan; Phillip Lippe; Taco Cohen (2022). Weakly supervised causal representation learning. 
[b13] Chris Burgess; Hyunjik Kim (2018). 3d shapes dataset. 
[b14] Rich Caruana (1997). Multitask learning. Machine learning
[b15] Xi Chen; Yan Duan; Rein Houthooft; John Schulman; Ilya Sutskever; Pieter Abbeel (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. 
[b16] Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Fei-Fei Li (2009-06-25). Imagenet: A large-scale hierarchical image database. IEEE Computer Society
[b17] Guneet Singh Dhillon; Pratik Chaudhari; Avinash Ravichandran; Stefano Soatto (2020). A baseline for few-shot image classification. 
[b18] Andrea Dittadi; Frederik Träuble; Francesco Locatello; Manuel Wuthrich; Vaibhav Agrawal; Ole Winther; Stefan Bauer; Bernhard Schölkopf (2021). On the transfer of disentangled representations in realistic settings. 
[b19] Lucas Dixon; John Li; Jeffrey Sorensen; Nithum Thain; Lucy Vasserman (2018). Measuring and mitigating unintended bias in text classification. 
[b20] Alexey Dosovitskiy; Josip Djolonga (2020). You only train once: Loss-conditional training of deep networks. 
[b21] Cian Eastwood; K I Christopher;  Williams (2018-05-03). A framework for the quantitative evaluation of disentangled representations. 
[b22] M Everingham; L Van Gool; C K I Williams; J Winn; A Zisserman (2007). The PASCAL Visual Object Classes Challenge. 
[b23] Li Fei-Fei; Rob Fergus; Pietro Perona (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. IEEE
[b24] Marco Fumero; Luca Cosmo; Simone Melzi; Emanuele Rodolà (2021-07). Learning disentangled representations via product manifold projection. PMLR
[b25] Yaroslav Ganin; Evgeniya Ustinova; Hana Ajakan; Pascal Germain; Hugo Larochelle; François Laviolette; Mario Marchand; Victor Lempitsky (2016). Domain-adversarial training of neural networks. The journal of machine learning research
[b26] Robert Geirhos; Jörn-Henrik Jacobsen; Claudio Michaelis; Richard Zemel; Wieland Brendel; Matthias Bethge; Felix A Wichmann (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence
[b27] Zhengyang Geng; Xin-Yu Zhang; Shaojie Bai; Yisen Wang; Zhouchen Lin (2021-12-06). On training implicit models. 
[b28] Ian J Goodfellow; Quoc V Le; Andrew M Saxe; Honglak Lee; Andrew Y Ng (2009-12-10). Measuring invariances in deep networks. 
[b29]  Curran Associates;  Inc (2009). . 
[b30] Anirudh Goyal; Alex Lamb; Jordan Hoffmann; Shagun Sodhani; Sergey Levine; Yoshua Bengio; Bernhard Schölkopf (2020). Recurrent independent mechanisms. 
[b31] Andreas Griewank; Andrea Walther (2008). Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM
[b32] Ishaan Gulrajani; David Lopez-Paz (2021). In search of lost domain generalization. 
[b33] Irina Higgins; Loïc Matthey; Arka Pal; Christopher Burgess; Xavier Glorot; Matthew Botvinick; Shakir Mohamed; Alexander Lerchner (2017). beta-vae: Learning basic visual concepts with a constrained variational framework. 
[b34] Timothy Hospedales; Antreas Antoniou; Paul Micaelli; Amos Storkey (2020). Meta-learning in neural networks: A survey. 
[b35] Ziniu Hu; Zhe Zhao; Xinyang Yi; Tiansheng Yao; Lichan Hong; Yizhou Sun; Ed H Chi (2022). Improving multi-task generalization via regularizing spurious correlation. 
[b36] Zeyi Huang; Haohan Wang; Eric P Xing; Dong Huang (2020). Self-challenging improves crossdomain generalization. Springer
[b37] Aapo Hyvärinen; Hiroaki Sasaki; Richard E Turner (2019-04). Nonlinear ICA using auxiliary variables and generalized contrastive learning. PMLR
[b38] Ali Jalali; Sujay Sanghavi; Pradeep Chao Ruan;  Ravikumar (). A dirty model for multi-task learning. 
[b39]  Curran Associates;  Inc (2010). . 
[b40] Hicham Janati; Marco Cuturi; Alexandre Gramfort (2019-04). Wasserstein regularization for sparse multi-task regression. PMLR
[b41] Yibo Jiang; Victor Veitch (2022). Invariant and transportable representations for anti-causal domain shifts. 
[b42] Ilyes Khemakhem; P Diederik; Ricardo Pio Kingma; Aapo Monti;  Hyvärinen (2020-08). Variational autoencoders and nonlinear ICA: A unifying framework. PMLR
[b43] P Diederik; Jimmy Kingma;  Ba (2015). Adam: A method for stochastic optimization. 
[b44] Polina Kirichenko; Pavel Izmailov; Andrew Gordon; Wilson  (2022). Last layer re-training is sufficient for robustness to spurious correlations. 
[b45] Pang Wei Koh; Shiori Sagawa; Henrik Marklund; Sang Michael Xie; Marvin Zhang; Akshay Balsubramani; Weihua Hu; Michihiro Yasunaga; Richard Lanas Phillips; Irena Gao; Tony Lee; Etienne David; Ian Stavness; Wei Guo; Berton Earnshaw; Imran S Haque; Sara M Beery; Jure Leskovec; Anshul Kundaje; Emma Pierson; Sergey Levine; Chelsea Finn; Percy Liang (2021-07). WILDS: A benchmark of in-the-wild distribution shifts. PMLR
[b46] David Krueger; Ethan Caballero; Joern-Henrik Jacobsen; Amy Zhang; Jonathan Binas; Dinghuai Zhang; Remi Le Priol; Aaron Courville (2021). Out-of-distribution generalization via risk extrapolation (rex). PMLR
[b47] D Tejas; William F Kulkarni; Pushmeet Whitney; Joshua B Kohli;  Tenenbaum (2015). Deep convolutional inverse graphics network. 
[b48] Sébastien Lachapelle; Tristan Deleu; Divyat Mahajan; Ioannis Mitliagkas; Yoshua Bengio; Simon Lacoste-Julien; Quentin Bertrand (2022). Synergies between disentanglement and sparsity: a multi-task learning perspective. 
[b49] Sébastien Lachapelle; Pau Rodriguez; Yash Sharma; Katie E Everett; Rémi Le Priol; Alexandre Lacoste; Simon Lacoste-Julien (2022). Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ica. PMLR
[b50] Yann Lecun; Jie Fu; Leon Huang;  Bottou (2004). Learning methods for generic object recognition with invariance to pose and lighting. IEEE
[b51] Kwonjoon Lee; Subhransu Maji; Avinash Ravichandran; Stefano Soatto (2019). Meta-learning with differentiable convex optimization. 
[b52] Da Li; Yongxin Yang; Yi-Zhe Song; Timothy Hospedales (2018). Learning to generalize: Metalearning for domain generalization. 
[b53] Da Li; Yongxin Yang; Yi-Zhe Song; Timothy M Hospedales (2017). Deeper, broader and artier domain generalization. IEEE Computer Society
[b54] Haoliang Li; Sinno Jialin Pan; Shiqi Wang; Alex C Kot (2018). Domain generalization with adversarial feature learning. 
[b55] Ya Li; Xinmei Tian; Mingming Gong; Yajing Liu; Tongliang Liu; Kun Zhang; Dacheng Tao (2018). Deep domain generalization via conditional invariant adversarial networks. 
[b56] Phillip Lippe; Sara Magliacane; Sindy Löwe; Yuki M Asano; Taco Cohen; Stratis Gavves (2022-07-23). CITRIS: causal identifiability from temporal intervened sequences. PMLR
[b57] Francesco Locatello; Stefan Bauer; Mario Lucic; Gunnar Rätsch; Sylvain Gelly; Bernhard Schölkopf; Olivier Bachem (2019-06-15). Challenging common assumptions in the unsupervised learning of disentangled representations. PMLR
[b58] Francesco Locatello; Stefan Bauer; Mario Lucic; Gunnar Rätsch; Sylvain Gelly; Bernhard Schölkopf; Olivier Bachem (2020). A sober look at the unsupervised learning of disentangled representations and their evaluation. J. Mach. Learn. Res
[b59] Francesco Locatello; Ben Poole; Gunnar Rätsch; Bernhard Schölkopf; Olivier Bachem; Michael Tschannen (2020-07). Weakly-supervised disentanglement without compromises. PMLR
[b60] Aurelie C Lozano; Grzegorz Swirszcz (2012-07-01). Multi-level lasso for sparse multi-task regression. 
[b61] Chaochao Lu; Yuhuai Wu; José Miguel Hernández-Lobato; Bernhard Schölkopf (2022). Invariant causal representation learning for out-of-distribution generalization. 
[b62] Zvika Marx; Leslie Michael T Rosenstein; Thomas G Pack Kaelbling;  Dietterich (2005). Transfer learning with an ensemble of background tasks. Inductive Transfer
[b63] Loic Matthey; Irina Higgins; Demis Hassabis; Alexander Lerchner (2017). dsprites: Disentanglement testing sprites dataset. 
[b64] John Miller; Rohan Taori; Aditi Raghunathan; Shiori Sagawa; Pang Wei Koh; Vaishaal Shankar; Percy Liang; Yair Carmon; Ludwig Schmidt (2021-07). Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. PMLR
[b65] Krikamol Muandet; David Balduzzi; Bernhard Schölkopf (2013-06-21). Domain generalization via invariant feature representation. 
[b66] Hyeonseob Nam; Hyunjae Lee; Jongchan Park; Wonjun Yoon; Donggeun Yoo (2021). Reducing domain gap by reducing style bias. 
[b67] Boris N Oreshkin; Pau Rodríguez López; Alexandre Lacoste (2018-12-03). TADAM: task dependent adaptive metric for improved few-shot learning. 
[b68] G Parascandolo; N Kilbertus; M Rojas-Carulla; B Schölkopf (2018). Learning independent causal mechanisms. 
[b69] Ji Ho; Park ; Jamin Shin; Pascale Fung (2018). Reducing gender bias in abusive language detection. Association for Computational Linguistics
[b70] Adam Paszke; Sam Gross; Francisco Massa; Adam Lerer; James Bradbury; Gregory Chanan; Trevor Killeen; Zeming Lin; Natalia Gimelshein; Luca Antiga; Alban Desmaison; Andreas Kopf; Edward Yang; Zachary Devito; Martin Raison; Alykhan Tejani; Sasank Chilamkurthy; Benoit Steiner; Lu Fang; Junjie Bai; Soumith Chintala (). Pytorch: An imperative style, highperformance deep learning library. 
[b71]  Curran Associates;  Inc (2019). . 
[b72] Jielin Qiu; Yi Zhu; Xingjian Shi; Florian Wenzel; Zhiqiang Tang; Ding Zhao; Bo Li; Mu Li (2022). Are multimodal models robust to image and text perturbations?. 
[b73] Scott E Reed; Yi Zhang; Yuting Zhang; Honglak Lee (2015). Deep visual analogy-making. 
[b74] Antonio Bryan C Russell; Kevin P Torralba; William T Murphy;  Freeman (2008). Labelme: a database and web-based tool for image annotation. International journal of computer vision
[b75] Shiori Sagawa; Pang Wei Koh; B Tatsunori; Percy Hashimoto;  Liang (2019). Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. 
[b76] Shiori Sagawa; Aditi Raghunathan; Pang Wei Koh; Percy Liang (2020-07). An investigation of why overparameterization exacerbates spurious correlations. PMLR
[b77] Ruslan Salakhutdinov (1973). Deep learning. ACM
[b78] Victor Sanh; Lysandre Debut; Julien Chaumond; Thomas Wolf (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 
[b79] Jürgen Schmidhuber (1992). Learning factorial codes by predictability minimization. Neural computation
[b80] Bernhard Schölkopf; Francesco Locatello; Stefan Bauer; Nan Rosemary Ke; Nal Kalchbrenner; Anirudh Goyal; Yoshua Bengio (2021). Toward causal representation learning. 
[b81] Anna Seigal; Chandler Squires; Caroline Uhler (2022). Linear causal disentanglement via interventions. 
[b82] Amanpreet Singh; Ronghang Hu; Vedanuj Goswami; Guillaume Couairon; Wojciech Galuba; Marcus Rohrbach; Douwe Kiela (2021). FLAVA: A foundational language and vision alignment model. 
[b83] Jake Snell; Kevin Swersky; Richard S Zemel (2017). Prototypical networks for few-shot learning. 
[b84] Peter Sorrenson; Carsten Rother; Ullrich Köthe (2020). Disentanglement by nonlinear ICA with general incompressible-flow networks (GIN). 
[b85] Trevor Standley; Dawn Amir Roshan Zamir; Leonidas J Chen; Jitendra Guibas; Silvio Malik;  Savarese (2020-07). Which tasks should be learned together in multi-task learning. PMLR
[b86] Baochen Sun; Jiashi Feng; Kate Saenko (2017). Correlation alignment for unsupervised domain adaptation. Springer
[b87] Baochen Sun; Kate Saenko (2016). Deep coral: Correlation alignment for deep domain adaptation. Springer
[b88] Victor Veitch; D' Alexander; Steve Amour; Jacob Yadlowsky;  Eisenstein (2021). Counterfactual invariance to spurious correlations: Why and how to pass stress tests. 
[b89] Hemanth Venkateswara; Jose Eusebio; Shayok Chakraborty; Sethuraman Panchanathan (2017). Deep hashing network for unsupervised domain adaptation. IEEE Computer Society
[b90] Oriol Vinyals; Charles Blundell; Tim Lillicrap; Koray Kavukcuoglu; Daan Wierstra (2016). Matching networks for one shot learning. 
[b91] Catherine Wah; Steve Branson; Peter Welinder; Pietro Perona; Serge Belongie (2011). The caltech-ucsd birds-200-2011 dataset. 
[b92] Zihao Wang; Victor Veitch (2022). A unified causal view of domain invariant representation learning. 
[b93] Zirui Wang; Zihang Dai; Barnabás Póczos; Jaime G Carbonell (2019). Characterizing and avoiding negative transfer. Computer Vision Foundation / IEEE
[b94] Martin Wattenberg; Fernanda Viégas; Ian Johnson (2016). How to use t-sne effectively. Distill
[b95] Florian Wenzel; Andrea Dittadi; Peter V Gehler; Carl-Johann Simon-Gabriel; Max Horn; Dominik Zietlow; David Kernert; Chris Russell; Thomas Brox; Bernt Schiele; Bernhard Schölkopf; Francesco Locatello (2022). Assaying out-of-distribution generalization in transfer learning. 
[b96] Olivia Wiles; Sven Gowal; Florian Stimberg;  Sylvestre-Alvise; Ira Rebuffi; Krishnamurthy Ktena; Ali Dvijotham; Cemgil Taylan (2022). A fine-grained analysis on distribution shift. 
[b97] Matthew Willetts; Brooks Paige (2021). I don't need u: Identifiable non-linear ica without side information. 
[b98] Thomas Wolf; Lysandre Debut; Victor Sanh; Julien Chaumond; Clement Delangue; Anthony Moi; Pierric Cistac; Tim Rault; Rémi Louf; Morgan Funtowicz (2019). Huggingface's transformers: State-of-the-art natural language processing. 
[b99] Mitchell Wortsman; Gabriel Ilharco; Yitzhak Samir; Rebecca Gadre; Raphael Gontijo Roelofs; Ari S Lopes; Hongseok Morcos; Ali Namkoong; Yair Farhadi; Simon Carmon; Ludwig Kornblith;  Schmidt (2022-07-23). Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. PMLR
[b100] Jianxiong Xiao; James Hays; Krista A Ehinger; Aude Oliva; Antonio Torralba (2010-06-18). SUN database: Large-scale scene recognition from abbey to zoo. IEEE Computer Society
[b101] Huan Shen Yan; Nanxiang Song; Lincan Li; Liu Zou;  Ren (2020). Improve unsupervised domain adaptation with mixup training. 
[b102] Weiran Yao; Yuewen Sun; Alex Ho; Changyin Sun; Kun Zhang (2022). Learning temporally causal latent processes from general temporal data. 
[b103] Lu Yuan; Dongdong Chen; Yi-Ling Chen; Noel Codella; Xiyang Dai; Jianfeng Gao; Houdong Hu; Xuedong Huang; Boxin Li; Chunyuan Li; Ce Liu; Mengchen Liu; Zicheng Liu; Yumao Lu; Yu Shi; Lijuan Wang; Jianfeng Wang; Bin Xiao; Zhen Xiao; Jianwei Yang; Michael Zeng; Luowei Zhou; Pengchuan Zhang (2021). Florence: A new foundation model for computer vision. 
[b104] Marvin Zhang; Henrik Marklund; Nikita Dhawan; Abhishek Gupta; Sergey Levine; Chelsea Finn (2021). Adaptive risk minimization: Learning to adapt to domain shift. Advances in Neural Information Processing Systems
[b105] Yu Zhang; Qiang Yang (2018). An overview of multi-task learning. National Science Review
[b106] Bolei Zhou; Agata Lapedriza; Aditya Khosla; Aude Oliva; Antonio Torralba (2017). Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence
[b107] Kaiyang Zhou; Ziwei Liu; Yu Qiao; Tao Xiang; Chen Change Loy (2021). Domain generalization: A survey. 
[b108] Jinguo Zhu; Xizhou Zhu; Wenhai Wang; Xiaohua Wang; Hongsheng Li; Xiaogang Wang; Jifeng Dai (2022). Uni-perceiver-moe: Learning sparse generalist models with conditional moes. 

Figures:
Figure fig_0: 1
Type: figure
Caption: Figure 1 :1Figure 1: Model scheme: Illustrations of the (Top) the inner loop stage and outer loop following the steps of the algorithmic procedure described in Section B.1 in the Appendix. properties. It is composed of the weighted sum of a sparsity penalty Reg L1 and an entropy-based feature sharing penalty: Reg sharing Reg(ϕ) = αReg L1 (ϕ) + βReg sharing (ϕ),(1)
Data: 

Figure fig_1: 2
Type: figure
Caption: Figure 2 :2Figure 2: Assumed causal generative model: the gray variables are unobserved. Observations x are generated by some unknown mixing of a set of factors of variations z. Additionally, we observe a distribution of supervised tasks, only depending on a subset of factors of variations indexed by S.
Data: 

Figure fig_2: 4
Type: figure
Caption: Figure 4 :4Figure 4: Task compositional generalization: Mean accuracy over 100 random test tasks reported for group of tasks of growing support (second, third, fourth column) for a model trained without inductive biases (blue, attaining DCI = 29.4) and enforcing them (orange, DCI = 59.4). The latter show better compositional generalization resulting from the properties enforced on the representation. Exact values are reported in Table9in Appendix.
Data: 

Figure fig_3: 5
Type: figure
Caption: Figure 5 :5Figure 5: Quantitative results on CivilComments: we report the accuracy on test averaged across all demographic groups (left group), and the worst group accuracy, on the right. Our method (green) performs similarly in terms of average accuracy and outperforms in terms of worst group accuracy, without using any knowledge on the group composition in the training data. For exact values and error estimates, seeTable 10 in the Appendix.
Data: 

Figure tab_0: 2
Type: table
Caption: Quantitative results for few-shot transfer learning, with our method consistently outperforming ERM across all sample sizes and data sets.
Data: N-shot/Algorithm OOD accuracy (averaged by domains)1-shotPACSVLCS OfficeHomeWaterbirdsERM80.559.756.479.8Ours81.568.258.488.45-shotERM87.171.775.779.8Ours88.374.577.087.610-shotERM87.974.081.084.2Ours90.477.382.089.2

Figure tab_1: 3
Type: table
Caption: Quantitative evaluation on Camelyon17: we report accuracy both on ID and OOD splits. Our approach achieves significantly higher validation and test OOD accuracy.
Data: Validation(ID) Validation (OOD) Test (OOD)ERM93.28470.3CORAL95.486.259.5IRM91.686.264.2Ours93.2 ±0.389.9±0.674.1±0.2

Figure tab_2: 
Type: table
Caption: Table 10 in the Appendix.
Data: DomainBed. We evaluate the domain generalizationperformance on the PACS, VLCS and OfficeHomedatasets from the DomainBed [32] test suite (see Ap-pendix C.1 for more details). On these datasets, wetrain on N -1 and leave one out for testing. Reg-ularization parameters α and β are tuned accordingto validation sets of PACS, and used accordingly onthe other dataset. For these experiments we use aResNet50 pretrained on Imagenet [17] as a back-bone, as done in [32] To fit the linear head we sam-ple 10 times with different samples sizes from thetraining domains and we report the mean score andstandard deviation. Results are reported in Table 4,showing how enforcing sparse sufficiency and mini-mality leads consistently to better OOD performance.Comparisons with 13 additional baselines is in Ap-pendix D.4.

Figure tab_3: 4
Type: table
Caption: Results for domain generalization on DomainBed. Our approach achieves consistently higher average OOD generalization, outperforming ERM in all cases except one.
Data: Dataset/AlgorithmOOD accuracy (by domain)PACSSAPCAverageERM77.9 ± 0.4 88.1 ± 0.1 97.8 ± 0.0 79.1 ± 0.985.7Ours83.1 ± 0.1 86.7± 0.8 97.8 ± 0.1 83.5 ± 0.187.5VLCSCLVSAverageERM97.6± 1.063.3 ± 0.9 76.4 ± 1.5 72.2 ± 0.577.4Ours98.1± 0.2 63.4± 0.5 78.2 ± 0.7 73.9± 0.878.4OfficeHomeCAPRAverageERM53.4± 0.662.7 ± 1.1 76.5 ± 0.477.3 ± 0.67.5Ours56.3± 0.1 66.7 ± 0.7 79.2± 0.5 81.3 ± 0.470.9WaterbirdsLLLWWLWWAverageERM98.6 ± 0.352.05 ± 368.5 ± 393 ± 0.381.3Ours99.5 ± 0.1 73.0 ± 2.585.0 ± 295.5 ± 0.490.54.3 Few-shot transfer learning.


Formulas:
Formula formula_0: x U g θ f ϕ ẑU g θ ŷU L inner x Q g θ f ϕ * ϕ * ẑQ g θ ŷQ L outer

Formula formula_2: Reg L1 (ϕ) = 1 T C t,c,m |ϕ t,m,c |(2)

Formula formula_3: Reg sharing (ϕ) = H( φm ) = - m φm log( φm )(3)

Formula formula_4: min θ E t [L outer (f ϕ * (g θ (x Q t ), y Q t ))],(4)

Formula formula_5: min ϕ 1 T t L inner (ŷ U t , y U t ) + Reg(ϕ),(5)

Formula formula_6: Q t with elements (x Q t , y Q ) ∈ Q t . Algorithm.

Formula formula_7: f * t = f * t | St for S t ∼ p(S) 2. minimality: ̸ ∃S ′ ̸ = S t ⊂ S s.t. f * t | S ′ = f * t ,

Formula formula_8: p(S ∩ S ′ = {i}) > 0 or p({i} ∈ (S ∪ S ′ ) -(S ′ ∩ S)) > 0.

