Title: Disentangling Representations via Sparse and Shared Feature Activations in Multi-Task Learning

Abstract: Prior research on disentangled representation learning (DRL) for high-dimensional data has predominantly focused on synthetic environments, limiting its applicability to real-world scenarios. This reliance on unsupervised or weakly-supervised objectives often overlooks the practical benefits of DRL for complex, real-world data. We introduce a novel approach that leverages knowledge from a diversified set of supervised tasks to learn a common, robustly disentangled representation. Our method operates within a supervised multi-task learning framework, where each task is assumed to depend on an unknown, sparse subset of the underlying factors of variation. We achieve disentanglement by enforcing sparse feature activations across tasks and promoting maximal information sharing where appropriate. Crucially, our framework achieves identifiability without direct observation of the factors of variation, relying instead on the inherent structure provided by multiple supervised tasks under proposed sufficiency and minimality assumptions. We rigorously validate our approach across six real-world distribution shift benchmarks and diverse data modalities (images, text), showcasing the effective transferability of disentangled representations to practical scenarios.

Section: Introduction
Learning meaningful and reusable representations from high-dimensional data is a central challenge in deep learning [8,75,78,77]. Disentangled Representation Learning (DRL) [56,8,33] addresses this by aiming to recover the underlying factors of variation (FOVs) of the data distribution. Ideally, disentangled representations capture all relevant information in a compact, interpretable structure [46,16], independent of any single task [29]. The separation of information into interventionally independent factors [78] is posited to enable robust downstream predictions, a benefit partially validated in synthetic settings [19,58]. However, these theoretical advantages have yet to fully materialize in real-world representation learning problems, primarily due to the scalability limitations of current DRL approaches.
Herein, we address these limitations by leveraging knowledge from diverse supervised task objectives to learn superior representations for high-dimensional data. We specifically investigate the intrinsic link between disentanglement and out-of-distribution (OOD) generalization in real-world data settings. While a broad spectrum of tasks is expected to yield richer, more generalizable representations, this is often hindered by task competition, leading to suboptimal models. This phenomenon, termed negative transfer [61,91] in transfer learning or task competition [83] in multi-task learning, arises when a model with limited capacity attempts to learn disparate tasks requiring high feature variability or coverage. Using shared features for conflicting objectives can introduce noise and amplify sensitivity to spurious correlations [35,27,7], as features may be simultaneously predictive and detrimental across tasks. To mitigate this, we propose a framework that utilizes a diverse set of tasks, each assumed to depend on an unknown, sparse subset of the factors of variation. We demonstrate that disentangled representations naturally emerge without explicit annotation of these factors, guided by two fundamental representation constraints:
• Sparse Sufficiency: Features are activated sparsely for each task, ensuring that any given task can be solved using a minimal subset of features. This promotes task-specific relevance and reduces irrelevant feature activations.
• Minimality: Features are maximally shared across tasks, preventing duplication and encouraging the reuse of features. This leads to a compact and efficient representation by avoiding redundant information.
These properties are inherently desirable, yielding features that (i) are disentangled with respect to the underlying factors of variation of the task data distribution (as theoretically supported by Proposition 2.1), (ii) exhibit enhanced generalization capabilities in the presence of distribution shifts, and (iii) effectively mitigate negative transfer phenomena. Practically, we implement these principles within a meta-learning framework, employing a sparsity regularizer for feature sufficiency and an entropy-based regularizer for feature sharing, both integrated into the base learner. Our experiments demonstrate that this model learns meaningful disentangled representations, leading to strong generalization performance on real-world datasets. Our key contributions are:
• We propose a novel meta-learning framework that successfully learns disentangled representations by leveraging knowledge from a distribution of diverse tasks. This framework uniquely integrates sparse sufficiency and minimality principles, enabling the coexistence of task-specific and generalizable features.
• We rigorously validate our theoretical identifiability result (Proposition 2.1) on synthetic datasets, demonstrating that our sufficiency and minimality principles effectively recover ground-truth factors of variation in idealized, controlled settings.
• We significantly extend our empirical evaluation to complex, non-synthetic datasets lacking explicit factor annotations, showcasing our approach's strong out-of-distribution generalization capabilities across various domain generalization and distribution shift benchmarks.

Section: Method
Given a distribution of tasks t ∼ T , each with associated data (x t , y t ) ∼ P t , our objective is to learn a disentangled representation g(x) = ẑ ∈ Ẑ ⊆ R M . This representation is designed to generalize effectively to unseen tasks, and its learning is guided by the explicit imposition of sparse sufficiency and minimality inductive biases.

Section: Learning sparse and shared features
Our architecture (see Figure 1) comprises a shared backbone module g θ and distinct linear classification heads f ϕt , each specific to a task t. The backbone learns a general feature representation common to all classification tasks. Each linear head then solves its respective classification problem within the feature space Ẑ, while simultaneously enforcing the feature sufficiency and minimality principles. Following the standard meta-learning paradigm [34], g θ acts as the meta-learner, and the task-specific f ϕt serve as base learners. For a new task, we assume access to a support set U , containing samples (x U , y U ) ∈ U . These samples are used to fit the optimal linear head f ϕ * for that task. Predictions for a query x Q ∈ Q are then obtained via the forward pass ŷ = f ϕ * (g θ (x Q )).
Enforcing feature minimality and sufficiency. To effectively solve tasks within the feature space Ẑ of the backbone module, we introduce a composite regularizer, Reg(ϕ), applied to the classification heads f ϕ . The parameters of these heads are denoted by ϕ ∈ R T ×M ×C , where T is the number of tasks, M is the number of features, and C is the number of classes. This regularizer is designed to enforce both feature minimality and sufficiency. With scalar weights α and β, the overall regularizer is given by:
Reg(ϕ) = αReg L1 (ϕ) + βReg sharing (ϕ) (1)
The individual penalty terms are defined as:
Reg L1 (ϕ) = 1 T C t,c,m |ϕ t,m,c |(2)
Reg sharing (ϕ) = H( φm ) = - m φm log( φm )(3)
where φm = 1

Here, φm = (1/T) Σ t,c |ϕ t,c,m| / Σ t,c,m |ϕ t,c,m| represents the normalized importance of feature m, averaged across tasks. Sparse sufficiency is enforced by the L1-norm regularizer, which compels the classification head to utilize only a sparse subset of features. Minimality is enforced by the feature sharing term: minimizing the entropy of the distribution of feature importances (i.e., normalized |ϕ t |) averaged across a mini-batch of T tasks, which results in a more peaked distribution of activations across tasks. This mechanism promotes feature clustering across tasks, encouraging their reuse when beneficial. We note that alternative regularizers from the linear multi-task learning literature (e.g., [59,39,38]) could also enforce sparse sufficiency and minimality; exploring these remains a promising avenue for future work.

Section: Training method
We train our model using a meta-learning approach, minimizing the expected test error across the task distribution t ∼ T . This process is formulated as a bi-level optimization problem. The optimal backbone model g θ * is determined by the outer optimization problem:
min θ E t [L outer (f ϕ * (g θ (x Q t ), y Q t ))],(4)
where f ϕ * denotes the optimal classifiers derived from the inner optimization problem. Here, (x Q t , y Q t ) ∈ Q t represents the test (or query) data from the query set Q t for task t. Conversely, the optimal classifiers f ϕ * are obtained by solving the inner optimization problem, given a support set U t with samples (x U t , y U t ) ∈ U for task t. It is typical for the support set to be distinct from the query set (i.e., U ∩ Q = ∅). The inner problem is defined as:
min ϕ 1 T t L inner (ŷ U t , y U t ) + Reg(ϕ),(5)
where ŷU t = f ϕ (g θ (x U t )). Both the inner loss L inner and outer loss L outer employ the cross-entropy loss function.
Task generation. Our method is applicable to standard supervised classification settings, where tasks are constructed dynamically. We define each task t as a C-way classification problem. Initially, a random subset of C classes is selected from a training domain D train , which encompasses K train classes. For each chosen class, we sample corresponding data points to form a random support set U t , with elements (x U t , y U ) ∈ U , and a disjoint random query set Q t , with elements (x Q t , y Q ) ∈ Q t .
In practice, we solve the bi-level optimization problem defined by (4) and (5) iteratively. In each iteration, a batch of T tasks is sampled, along with their respective support and query sets. First, the linear heads f ϕ are fitted using samples from the support set S t by solving the inner optimization problem (5) via stochastic gradient descent for a fixed number of steps. Second, the backbone g θ is updated using samples from the query set Q t by solving the outer optimization problem (4). This outer optimization employs implicit differentiation [11,31], as the optimal solution of the linear heads ϕ * is dependent on the backbone g θ , precluding direct differentiation with respect to θ. We address this dependency by utilizing the approximation strategy from [28] to compute the implicit gradients. A comprehensive summary of the algorithm is provided in Section B.1 of the Appendix.

Section: Theoretical analysis
We present a theoretical analysis of the proposed minimality and sparse sufficiency principles, demonstrating their role in achieving identifiability within a controlled setting. As depicted in Figure 2, we assume the existence of a set of independent latent factors z ∼ d i=1 p(z i ) that generate observations x through an unknown mixing function x = g * (z). Furthermore, we assume that the labels y t for a task t depend solely on a subset of these factors, indexed by S t ∼ P (S) (where S is an index set on z ∈ Z), via some unknown mixing function y t = f * t (z) (which may vary across tasks). We formalize the two principles imposed on f * as follows: 1. Sufficiency:
f * t = f * t | St for S t ∼ p(S)
2. Minimality: ̸ ∃S ′ ̸ = S t ⊂ S s.t. f * t | S ′ = f * t ,
where f | St indicates that the input to function f is restricted to the index set S t (with all other entries set to zero). Principle (1) asserts that f * t exclusively utilizes a subset of features, while (2) ensures the absence of duplicate features.
Proposition 2.1. Assume that g * is a diffeomorphism (i.e., smooth with a smooth inverse), f * satisfies the aforementioned sufficiency and minimality properties, and p(S) satisfies:
p(S ∩ S ′ = {i}) > 0 or p({i} ∈ (S ∪ S ′ ) -(S ′ ∩ S)) > 0.
Under these conditions, by observing unlimited data from p(X, Y ), it is possible to recover a representation ẑ that is an axis-aligned, component-wise transformation of z.
Remarks: This proposition serves as a crucial theoretical validation, demonstrating that in an idealized setting, our inductive biases are sufficient for recovering the factors of variation. It is important to note that the proof is non-constructive and does not prescribe a specific method. In practice, we leverage these same constraints as inductive biases, experimentally showing the emergence of disentangled representations in controlled synthetic environments. For real-world data, direct disentanglement measurement is challenging, a global notion of ground-truth factors might be ill-posed, and the assumptions of Proposition 2.1 are likely to be violated. Nevertheless, sparse sufficiency and minimality consistently yield meaningful factorizations of the representation for the tasks under consideration.
Relation to [47] and [58]: Our theoretical result aligns with concurrent work [47] and can be viewed as a corollary derived using a distinct proof technique and slightly relaxed assumptions. A key differentiator is our incorporation of feature minimality, which uniquely enables our framework to handle scenarios where the number of factors of variation is unknown—a critical aspect for real-world datasets, which form the primary focus of this paper. In contrast, [47] relies solely on sparse sufficiency, which, while sufficient for identifiability when the ground-truth number of factors is known, proves inadequate for achieving high disentanglement in its absence (see Figure 3) and translates poorly to real-world data (as shown in Table 16 and the empirical comparison in Appendix D.8). Intriguingly, their analysis also suggests that our approach offers benefits in terms of sample complexity for downstream transfer learning tasks. Our proof technique extends the general construction developed for multi-view data in [58], adapting it to our unique setting where we observe a single task dependent on a subset of factors, rather than multiple views with shared factors of variation.

Section: Related work
Learning from multiple tasks and domains. Our method addresses the critical challenge of learning a generalizable representation across diverse and potentially unseen tasks [15,103] and environments [105,32,44,97,63,94,64], particularly when these tasks exhibit competition during training [61,91,83]. Previous research has attempted to resolve task competition by employing task-specific modules that operate independently during training [67,101,80]. While these methods effectively learn specialized modules, they often fail to leverage synergistic information that might exist between tasks. In contrast, our approach aligns more closely with multi-task methods that aim to learn a generalist model by explicitly exploiting multi-task interactions [106,5]. While other meta-learning objectives have been proposed for multi-task learning [18,81,50,9], notably [50] learns a generalist model in a few-shot setting without explicitly promoting feature sharing or sparsity. Our work distinguishes itself by rephrasing the multi-task objective function to intrinsically encode both feature sharing and sparsity, thereby directly mitigating task competition.
Similar to prior work in domain generalization, we posit the existence of stable features for a given task [64,4,86,40,90] and amortize learning across multiple environments. However, unlike conventional approaches, we do not aim to learn an invariant representation a priori. Instead, our method learns sufficient and minimal features for each task, which are adaptively selected at test time by fitting a linear head. Following the perspective of [32], our approach can be interpreted as learning the final classifier through empirical risk minimization, but operating on features enriched with information from multiple domains.
Disentangled representations. Disentangled Representation Learning (DRL) [8,33] is fundamentally concerned with recovering the latent factors of variation that govern a given data distribution. A foundational result by [56] established that without some form of supervision (direct or indirect) on these Factors of Variation (FOVs), their recovery is generally impossible. Consequently, much subsequent work has shifted towards identifiable settings [58,25], often leveraging non-i.i.d. data and even accommodating latent causal relations between factors. These approaches typically fall into two broad categories: (1) methods where data is non-independently sampled, often assuming sparse interventions or sparse latent dynamics [30,55,13,100,2,79,48]; and (2) methods where data is non-identically distributed, such as being clustered into annotated groups [37,41,82,95,60]. Our method aligns with the second category, but crucially, we make no explicit assumptions about the factor distribution across tasks; instead, we focus solely on their relevance in terms of sparse sufficiency and minimality. This design choice is further reflected in our supervised classification training objective, which contrasts with the more common contrastive or unsupervised learning paradigms in the disentanglement literature. The work of [47], discussed in Section 2.3, represents a notable exception.

Section: Experiments
This section details our experimental setup and its underlying motivations. Synthetic experiments. We begin by evaluating our method on established disentanglement benchmarks [62,14,71,49], which provide ground-truth annotations. This allows for a quantitative assessment of our ability to learn disentangled representations. We further investigate the correlation between minimality, feature sharing, and disentanglement measures (Section 4.1), as well as the compositional generalization capabilities of representations learned from a limited set of tasks. These experiments serve to validate our theoretical claims, demonstrating that under the assumptions of Proposition 2.1, our methods quantitatively recover the true factors of variation. Domain generalization. For real-world datasets, where ground-truth disentanglement cannot be quantitatively measured and identifiability is not guaranteed (due to potential assumption violations), we shift our focus to the practical utility of disentangled representations. The ultimate goal is to learn features that transfer easily and robustly to downstream tasks. Thus, we first evaluate the efficacy of our representations on downstream tasks subject to distribution shifts, a context where isolating spurious features has been shown to enhance generalization in synthetic settings [19,58]. To assess the robustness of our representations to such shifts, we evaluate our method on domain generalization and domain shift tasks across six distinct benchmarks (Section 4.2). In a domain generalization scenario, we operate without access to samples from the testing domain, which is considered out-of-distribution (OOD) relative to the training domains. However, to solve a new task, our method relies on a small set of labeled data at test time to fit a linear head atop the learned feature space. Our strategy involves sampling class-balanced data points from the training distribution, assuming the label set Y remains constant in the testing domain, even if its distribution experiences subpopulation shifts.
Few-shot transfer learning. Finally, we assess the adaptability of our learned feature space to novel domains with limited labeled samples. For these transfer learning tasks, a linear head is fitted using the available sparse supervised data. The sparsity penalty α is kept consistent with its training value, while the feature sharing parameter β defaults to zero unless explicitly stated.
Experimental setting. To ensure a fair comparison with existing literature, we adhere to the standard experimental setup outlined in prior work [32,44]. Hyperparameters α and β are optimized via model selection on a validation set, unless otherwise specified. For baseline comparisons, we replace our backbone with that of the baseline method (e.g., detaching the classification head for ERM models) and then fit a new linear head on the same data. Critically, the linear head module trained at test time on top of the features remains identical for both our method and all comparative baselines. Despite its simplicity, the ERM baseline is included in our main paper experiments due to its demonstrated strong average performance on domain generalization benchmarks [32,44]. We also provide comparisons with other established approaches such as IRM [4], CORAL [85], and GroupDRO [73], with a more extensive comparison against [99,10,51,53,26,54,65,102,36,45] detailed in Appendix D.4. Comprehensive experimental details are provided in Appendix C.

Section: Synthetic experiments
We begin by demonstrating our approach's capability to recover the underlying factors of variation in synthetic data distributions, consistent with [62]. In these experiments, we assume partial information about a subset of factors of variation Z, and our objective is to learn a representation ẑ that aligns with these factors while effectively disregarding any spurious ones. We sample random tasks from a distribution T (detailed in Appendix C.3), focusing on binary classification problems where Y = {0, 1}. For instance, on the DSprites dataset, a valid task could be "Is there a big object on the left of the image?". In this specific case, the partially observed factors (quantized to two values) are the x-position and size. Table 1 illustrates how our proposed feature sufficiency and minimality properties facilitate disentanglement in the learned representations. We trained two identical models on a random distribution of sparse tasks defined on FOVs. For various datasets [62,14,49,71], the model trained without regularizers achieved comparable in-distribution (ID) accuracy but significantly lower disentanglement. Figure 3, a visual representation, further highlights the critical role of minimality. It plots the DCI metric for models trained on fixed DSprites tasks: training without regularizers results in no disentanglement (green); enforcing sparsity alone (yellow, akin to [47]) yields good disentanglement (DCI = 71.9), though features may still be split or duplicated. Crucially, enforcing both minimality and sparse sufficiency (magenta) achieves the best DCI (98.8). However, when β is excessively high (> 0.25), activated features tend to collapse into a few clusters across tasks. Full results and experiments on additional datasets are provided in Table 8 and Figures 6,7 in the Appendix.
Subsequently, we randomly select and fix two groups of tasks, S 1 and S 2 , totaling 18 tasks. Each task in these groups is supported by two FOVs, i.e., |S 1 | = |S 2 | = 2. These groups share one factor of variation but differ in another, such that S 1 ∩ S 2 = {i} for some {i} ∈ Z. The data within these tasks are intentionally designed to contain spurious correlations, where FOVs outside the task support are spuriously correlated with the task label. We initiate with an overestimated latent dimension of z = 6, aiming to recover a true z of size 3. Our network is trained to solve these tasks by enforcing sufficiency and minimality on the representation, varying the degrees of regularization through α and β. Figure 3 visually demonstrates how the alignment of learned features with ground-truth factors of variation is critically dependent on the choice of α and β. Performance ranges from negligible disentanglement (DCI = 27.8) to strong alignment as sufficiency and minimality are increasingly enforced. The model achieving the highest alignment (DCI = 98.8) effectively utilizes both sparsity and feature sharing. While sufficiency alone (similar to [47]) can correctly identify the task-relevant support, it often leads to split or duplicated features, resulting in lower disentanglement (DCI = 71.9). The feature sharing penalty is crucial for promoting clustering in the feature space with respect to tasks, thereby ensuring high disentanglement. However, an excessively high β (β > 0.25) can lead to failure cases where features over-cluster.
Table 1: Enforcing disentanglement. Table 1 presents DCI [22] disentanglement scores and in-distribution (ID) accuracy on test samples. It compares a model trained without enforcing sufficiency and minimality (top row) against a model with the regularizers activated (bottom row). While both models achieve similar accuracy, the model with activated regularizers consistently demonstrates significantly higher disentanglement. Refer to Table 7 for additional scores.
Disentanglement and minimality are correlated. In the synthetic setting, we further illustrate the crucial role of the feature sharing penalty. Minimizing the entropy of feature activations across mini-batches of tasks effectively induces clustering within the feature space. We investigated the robust correlation between the strength of this penalty and disentanglement metrics [22] by training 15 distinct models on the DSprites dataset, each with a linearly increasing β from 0 to 0.2. We observed a strong correlation coefficient of 0.947 with the DCI metric for the representations computed by each model, unequivocally demonstrating that the feature sharing property strongly promotes disentanglement. This finding further reinforces that sufficiency alone (i.e., solely enforcing sparsity) is insufficient to achieve optimal disentanglement.
Task compositional generalization. We conclude our synthetic experiments by evaluating the compositional generalization capabilities of the features learned by our method. This involves testing our model on a set of unseen tasks formed by novel combinations of tasks encountered during training. To achieve this, we first trained two models on the AbstractDSprites dataset using a random distribution of tasks, with each task's support limited to two factors (|S| = 2). The models differed only in whether their linear heads had regularizers activated or deactivated. Subsequently, we evaluated these models on 100 tasks sampled from a distribution with progressively increasing support on the factors of variation (|S| = 3, |S| = 4, |S| = 5). These tasks directly correspond to compositions of the tasks seen during training. The results are visualized in Figure 4, with accompanying detailed values presented in Table 9 in Appendix D.

Section: Domain Generalization
In this section, we evaluate our method on benchmarks from the domain generalization field [32,93,70] and those involving subpopulation distribution shifts [73,44]. Our objective is to demonstrate that a feature space learned with our proposed inductive biases exhibits strong performance on out-of-distribution real-world data. Subpopulation shifts. Subpopulation shifts manifest when the distribution of minority groups varies across different domains. We hypothesize that a feature space satisfying sparse sufficiency and minimality is inherently more robust to spurious correlations that disproportionately affect minority groups, thereby facilitating superior transferability to novel distributions. To validate this claim, we conducted experiments on two established benchmarks: Waterbirds [73] and CivilComments [44] (see Appendix C.1 for details).
For both datasets, we utilized the original train and test splits. Table 4 (last row) presents the results on the Waterbirds test set for various demographic groups (landbirds on land, landbirds on water, waterbirds on land, and waterbirds on water, respectively). The linear head was fitted on a random, class-balanced subset of the training domain, with the process repeated 10 times to report mean accuracy and standard deviation on the test set. For CivilComments, Figure 5 displays both the average and worst-group accuracy, comparing our method against ERM and GroupDRO [73]. While achieving performance comparable to ERM in terms of average accuracy, our method demonstrates superior robustness to spurious correlations, evidenced by its higher worst-group accuracy. Crucially, we outperform GroupDRO, which explicitly leverages subdomain statistics, whereas our approach operates without any prior knowledge of group composition. Detailed per-group results are provided in Appendix (Table 11). Camelyon17. For the Camelyon17 dataset, the model is trained strictly according to its original data splits. Table 3 presents our model's accuracy on both in-distribution (ID) and out-of-distribution (OOD) splits, benchmarked against various baselines [84,4]. Our method consistently achieves superior performance on the OOD test domains. This enhanced OOD generalization is intuitively attributed to minimality, which encourages the retention of features shared across the three training domains while diminishing the importance of domain-specific features (which often encapsulate spurious correlations with hospital environmental information). This effect can be further amplified at test time, as demonstrated in the ablation study in Appendix D.9, by strategically trading off some in-distribution performance for improved OOD accuracy. Finally, we showcase the ability of features learned with our method to adapt to a new domain with a small number of samples in a few-shot setting. We compare the results with ERM in Table 2, averaged by domains in each benchmark dataset. The comprehensive scores for individual domains, including mean accuracy and standard deviations over 100 draws for 1-shot, 5-shot, and 10-shot settings, are provided in Appendix D.5. Our approach consistently yields higher accuracy than ERM, underscoring the superior adaptation capabilities of our minimal and sufficiently sparse feature space.

Section: Additional results
Appendix D provides a comprehensive collection of supplementary results. This includes a detailed comparison with 14 baseline methods on various domain shift benchmarks (D.4), a qualitative and quantitative analysis of minimality and sparse sufficiency properties in real-world settings (D.2), and a favorable comparison against 6 additional meta-learning baselines, including [47] (D.8). Furthermore, an ablation study on the impact of feature clustering at test time is presented (D.9), along with a demonstration of our approach's ability to yield a task similarity measure (D.7)."

Section: Conclusions
In this paper, we successfully demonstrated a novel approach for learning disentangled representations from a distribution of tasks, achieved by rigorously enforcing feature sparsity and sharing. We established the identifiability of this setting theoretically and validated it extensively through experiments in controlled synthetic environments. Furthermore, our empirical results robustly confirm the benefits of these representations for out-of-distribution generalization in real-world settings, specifically by effectively isolating spurious and domain-specific factors that are detrimental under distribution shifts.
Limitations and future work: A primary limitation of our current work lies in the global assumption regarding the strength of the sparsity and feature sharing regularizers, α and β, applied uniformly across all tasks. In practical, real-world scenarios, the optimal representation properties might vary significantly for different tasks. As observed in our synthetic experiments (Figure 3), an excessively high β (e.g., > 0.25) can lead to features clustering too aggressively, hindering clear disentanglement and generalization. Future work could address this by exploiting task-specific knowledge (e.g., task distance measures) to adaptively tune α and β during training, or by conditioning training on a distribution of regularization parameters, as in [21], to enhance test-time generalization. Another area for improvement concerns the sampling procedure used to fit the linear head at test time. Random, class-balanced sampling from the training set may not always be optimal for achieving peak performance under diverse distribution shifts. Exploring alternative sampling strategies that incorporate available knowledge about the distribution shift (e.g., as in [43]) could lead to substantial performance gains at test time.

Section: Acknowledgments and Disclosure of Funding
Marco Fumero and Emanuele Rodolà gratefully acknowledge support from the ERC grant no.802554 (SPECGEO), PRIN 2020 project no.2020TA3K9N (LEGO.AI), and PNRR MUR project PE0000013-FAIR. Marco Fumero and Francesco Locatello were affiliated with Amazon during a portion of this project. We extend our gratitude to Julius von Kügelgen, Sebastian Lachapelle, and the anonymous reviewers for their invaluable feedback and insightful suggestions.


References:
[b0] Julius Adebayo; Justin Gilmer; Michael Muelly; Ian J Goodfellow; Moritz Hardt; Been Kim (2018-12-03). Sanity checks for saliency maps. 
[b1] Kartik Ahuja; Karthikeyan Shanmugam; R Kush; Amit Varshney;  Dhurandhar (2020-07). Invariant risk minimization games. PMLR
[b2] Isabela Albuquerque; João Monteiro; Mohammad Darvishi; H Tiago; Ioannis Falk;  Mitliagkas (2019). Generalizing to unseen domains via distribution matching. 
[b3] Martin Arjovsky; Léon Bottou; Ishaan Gulrajani; David Lopez-Paz (2019). Invariant risk minimization. 
[b4] Jinze Bai; Rui Men; Hao Yang; Xuancheng Ren; Kai Dang; Yichang Zhang; Xiaohuan Zhou; Peng Wang; Sinan Tan; An Yang (2022). Ofasys: A multi-modal multi-task learning system for building generalist models. 
[b5] Peter Bandi (). Camelyon17 dataset. 
[b6] Sara Beery; Grant Van Horn; Pietro Perona (2018). Recognition in terra incognita. 
[b7] Yoshua Bengio; Aaron Courville; Pascal Vincent (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence
[b8] Luca Bertinetto; João F Henriques; H S Philip; Andrea Torr;  Vedaldi (2019). Meta-learning with differentiable closed-form solvers. 
[b9] Gilles Blanchard; Aniket Anand Deshmukh; Ürun Dogan; Gyemin Lee; Clayton Scott (2021). Domain generalization by marginal transfer learning. The Journal of Machine Learning Research
[b10] Mathieu Blondel; Quentin Berthet; Marco Cuturi; Roy Frostig; Stephan Hoyer; Felipe Llinares-López; Fabian Pedregosa; Jean-Philippe Vert (2021). Efficient and modular implicit differentiation. 
[b11] Daniel Borkan; Lucas Dixon; Jeffrey Sorensen; Nithum Thain; Lucy Vasserman (2019). Nuanced metrics for measuring unintended bias with real data for text classification. 
[b12] Johann Brehmer; Pim De Haan; Phillip Lippe; Taco Cohen (2022). Weakly supervised causal representation learning. 
[b13] Chris Burgess; Hyunjik Kim (2018). 3d shapes dataset. 
[b14] Rich Caruana (1997). Multitask learning. Machine learning
[b15] Xi Chen; Yan Duan; Rein Houthooft; John Schulman; Ilya Sutskever; Pieter Abbeel (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. 
[b16] Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Fei-Fei Li (2009-06-25). Imagenet: A large-scale hierarchical image database. IEEE Computer Society
[b17] Guneet Singh Dhillon; Pratik Chaudhari; Avinash Ravichandran; Stefano Soatto (2020). A baseline for few-shot image classification. 
[b18] Andrea Dittadi; Frederik Träuble; Francesco Locatello; Manuel Wuthrich; Vaibhav Agrawal; Ole Winther; Stefan Bauer; Bernhard Schölkopf (2021). On the transfer of disentangled representations in realistic settings. 
[b19] Lucas Dixon; John Li; Jeffrey Sorensen; Nithum Thain; Lucy Vasserman (2018). Measuring and mitigating unintended bias in text classification. 
[b20] Alexey Dosovitskiy; Josip Djolonga (2020). You only train once: Loss-conditional training of deep networks. 
[b21] Cian Eastwood; K I Christopher;  Williams (2018-05-03). A framework for the quantitative evaluation of disentangled representations. 
[b22] M Everingham; L Van Gool; C K I Williams; J Winn; A Zisserman (2007). The PASCAL Visual Object Classes Challenge. 
[b23] Li Fei-Fei; Rob Fergus; Pietro Perona (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. IEEE
[b24] Marco Fumero; Luca Cosmo; Simone Melzi; Emanuele Rodolà (2021-07). Learning disentangled representations via product manifold projection. PMLR
[b25] Yaroslav Ganin; Evgeniya Ustinova; Hana Ajakan; Pascal Germain; Hugo Larochelle; François Laviolette; Mario Marchand; Victor Lempitsky (2016). Domain-adversarial training of neural networks. The journal of machine learning research
[b26] Robert Geirhos; Jörn-Henrik Jacobsen; Claudio Michaelis; Richard Zemel; Wieland Brendel; Matthias Bethge; Felix A Wichmann (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence
[b27] Zhengyang Geng; Xin-Yu Zhang; Shaojie Bai; Yisen Wang; Zhouchen Lin (2021-12-06). On training implicit models. 
[b28] Ian J Goodfellow; Quoc V Le; Andrew M Saxe; Honglak Lee; Andrew Y Ng (2009-12-10). Measuring invariances in deep networks. 
[b29]  Curran Associates;  Inc (2009). . 
[b30] Anirudh Goyal; Alex Lamb; Jordan Hoffmann; Shagun Sodhani; Sergey Levine; Yoshua Bengio; Bernhard Schölkopf (2020). Recurrent independent mechanisms. 
[b31] Andreas Griewank; Andrea Walther (2008). Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM
[b32] Ishaan Gulrajani; David Lopez-Paz (2021). In search of lost domain generalization. 
[b33] Irina Higgins; Loïc Matthey; Arka Pal; Christopher Burgess; Xavier Glorot; Matthew Botvinick; Shakir Mohamed; Alexander Lerchner (2017). beta-vae: Learning basic visual concepts with a constrained variational framework. 
[b34] Timothy Hospedales; Antreas Antoniou; Paul Micaelli; Amos Storkey (2020). Meta-learning in neural networks: A survey. 
[b35] Ziniu Hu; Zhe Zhao; Xinyang Yi; Tiansheng Yao; Lichan Hong; Yizhou Sun; Ed H Chi (2022). Improving multi-task generalization via regularizing spurious correlation. 
[b36] Zeyi Huang; Haohan Wang; Eric P Xing; Dong Huang (2020). Self-challenging improves crossdomain generalization. Springer
[b37] Aapo Hyvärinen; Hiroaki Sasaki; Richard E Turner (2019-04). Nonlinear ICA using auxiliary variables and generalized contrastive learning. PMLR
[b38] Ali Jalali; Sujay Sanghavi; Pradeep Chao Ruan;  Ravikumar (). A dirty model for multi-task learning. 
[b39]  Curran Associates;  Inc (2010). . 
[b40] Hicham Janati; Marco Cuturi; Alexandre Gramfort (2019-04). Wasserstein regularization for sparse multi-task regression. PMLR
[b41] Yibo Jiang; Victor Veitch (2022). Invariant and transportable representations for anti-causal domain shifts. 
[b42] Ilyes Khemakhem; P Diederik; Ricardo Pio Kingma; Aapo Monti;  Hyvärinen (2020-08). Variational autoencoders and nonlinear ICA: A unifying framework. PMLR
[b43] P Diederik; Jimmy Kingma;  Ba (2015). Adam: A method for stochastic optimization. 
[b44] Polina Kirichenko; Pavel Izmailov; Andrew Gordon; Wilson  (2022). Last layer re-training is sufficient for robustness to spurious correlations. 
[b45] Pang Wei Koh; Shiori Sagawa; Henrik Marklund; Sang Michael Xie; Marvin Zhang; Akshay Balsubramani; Weihua Hu; Michihiro Yasunaga; Richard Lanas Phillips; Irena Gao; Tony Lee; Etienne David; Ian Stavness; Wei Guo; Berton Earnshaw; Imran S Haque; Sara M Beery; Jure Leskovec; Anshul Kundaje; Emma Pierson; Sergey Levine; Chelsea Finn; Percy Liang (2021-07). WILDS: A benchmark of in-the-wild distribution shifts. PMLR
[b46] David Krueger; Ethan Caballero; Joern-Henrik Jacobsen; Amy Zhang; Jonathan Binas; Dinghuai Zhang; Remi Le Priol; Aaron Courville (2021). Out-of-distribution generalization via risk extrapolation (rex). PMLR
[b47] D Tejas; William F Kulkarni; Pushmeet Whitney; Joshua B Kohli;  Tenenbaum (2015). Deep convolutional inverse graphics network. 
[b48] Sébastien Lachapelle; Tristan Deleu; Divyat Mahajan; Ioannis Mitliagkas; Yoshua Bengio; Simon Lacoste-Julien; Quentin Bertrand (2022). Synergies between disentanglement and sparsity: a multi-task learning perspective. 
[b49] Sébastien Lachapelle; Pau Rodriguez; Yash Sharma; Katie E Everett; Rémi Le Priol; Alexandre Lacoste; Simon Lacoste-Julien (2022). Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ica. PMLR
[b50] Yann Lecun; Jie Fu; Leon Huang;  Bottou (2004). Learning methods for generic object recognition with invariance to pose and lighting. IEEE
[b51] Kwonjoon Lee; Subhransu Maji; Avinash Ravichandran; Stefano Soatto (2019). Meta-learning with differentiable convex optimization. 
[b52] Da Li; Yongxin Yang; Yi-Zhe Song; Timothy Hospedales (2018). Learning to generalize: Metalearning for domain generalization. 
[b53] Da Li; Yongxin Yang; Yi-Zhe Song; Timothy M Hospedales (2017). Deeper, broader and artier domain generalization. IEEE Computer Society
[b54] Haoliang Li; Sinno Jialin Pan; Shiqi Wang; Alex C Kot (2018). Domain generalization with adversarial feature learning. 
[b55] Ya Li; Xinmei Tian; Mingming Gong; Yajing Liu; Tongliang Liu; Kun Zhang; Dacheng Tao (2018). Deep domain generalization via conditional invariant adversarial networks. 
[b56] Phillip Lippe; Sara Magliacane; Sindy Löwe; Yuki M Asano; Taco Cohen; Stratis Gavves (2022-07-23). CITRIS: causal identifiability from temporal intervened sequences. PMLR
[b57] Francesco Locatello; Stefan Bauer; Mario Lucic; Gunnar Rätsch; Sylvain Gelly; Bernhard Schölkopf; Olivier Bachem (2019-06-15). Challenging common assumptions in the unsupervised learning of disentangled representations. PMLR
[b58] Francesco Locatello; Stefan Bauer; Mario Lucic; Gunnar Rätsch; Sylvain Gelly; Bernhard Schölkopf; Olivier Bachem (2020). A sober look at the unsupervised learning of disentangled representations and their evaluation. J. Mach. Learn. Res
[b59] Francesco Locatello; Ben Poole; Gunnar Rätsch; Bernhard Schölkopf; Olivier Bachem; Michael Tschannen (2020-07). Weakly-supervised disentanglement without compromises. PMLR
[b60] Aurelie C Lozano; Grzegorz Swirszcz (2012-07-01). Multi-level lasso for sparse multi-task regression. 
[b61] Chaochao Lu; Yuhuai Wu; José Miguel Hernández-Lobato; Bernhard Schölkopf (2022). Invariant causal representation learning for out-of-distribution generalization. 
[b62] Zvika Marx; Leslie Michael T Rosenstein; Thomas G Pack Kaelbling;  Dietterich (2005). Transfer learning with an ensemble of background tasks. Inductive Transfer
[b63] Loic Matthey; Irina Higgins; Demis Hassabis; Alexander Lerchner (2017). dsprites: Disentanglement testing sprites dataset. 
[b64] John Miller; Rohan Taori; Aditi Raghunathan; Shiori Sagawa; Pang Wei Koh; Vaishaal Shankar; Percy Liang; Yair Carmon; Ludwig Schmidt (2021-07). Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. PMLR
[b65] Krikamol Muandet; David Balduzzi; Bernhard Schölkopf (2013-06-21). Domain generalization via invariant feature representation. 
[b66] Hyeonseob Nam; Hyunjae Lee; Jongchan Park; Wonjun Yoon; Donggeun Yoo (2021). Reducing domain gap by reducing style bias. 
[b67] Boris N Oreshkin; Pau Rodríguez López; Alexandre Lacoste (2018-12-03). TADAM: task dependent adaptive metric for improved few-shot learning. 
[b68] G Parascandolo; N Kilbertus; M Rojas-Carulla; B Schölkopf (2018). Learning independent causal mechanisms. 
[b69] Ji Ho; Park ; Jamin Shin; Pascale Fung (2018). Reducing gender bias in abusive language detection. Association for Computational Linguistics
[b70] Adam Paszke; Sam Gross; Francisco Massa; Adam Lerer; James Bradbury; Gregory Chanan; Trevor Killeen; Zeming Lin; Natalia Gimelshein; Luca Antiga; Alban Desmaison; Andreas Kopf; Edward Yang; Zachary Devito; Martin Raison; Alykhan Tejani; Sasank Chilamkurthy; Benoit Steiner; Lu Fang; Junjie Bai; Soumith Chintala (). Pytorch: An imperative style, highperformance deep learning library. 
[b71]  Curran Associates;  Inc (2019). . 
[b72] Jielin Qiu; Yi Zhu; Xingjian Shi; Florian Wenzel; Zhiqiang Tang; Ding Zhao; Bo Li; Mu Li (2022). Are multimodal models robust to image and text perturbations?. 
[b73] Scott E Reed; Yi Zhang; Yuting Zhang; Honglak Lee (2015). Deep visual analogy-making. 
[b74] Antonio Bryan C Russell; Kevin P Torralba; William T Murphy;  Freeman (2008). Labelme: a database and web-based tool for image annotation. International journal of computer vision
[b75] Shiori Sagawa; Pang Wei Koh; B Tatsunori; Percy Hashimoto;  Liang (2019). Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. 
[b76] Shiori Sagawa; Aditi Raghunathan; Pang Wei Koh; Percy Liang (2020-07). An investigation of why overparameterization exacerbates spurious correlations. PMLR
[b77] Ruslan Salakhutdinov (1973). Deep learning. ACM
[b78] Victor Sanh; Lysandre Debut; Julien Chaumond; Thomas Wolf (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 
[b79] Jürgen Schmidhuber (1992). Learning factorial codes by predictability minimization. Neural computation
[b80] Bernhard Schölkopf; Francesco Locatello; Stefan Bauer; Nan Rosemary Ke; Nal Kalchbrenner; Anirudh Goyal; Yoshua Bengio (2021). Toward causal representation learning. 
[b81] Anna Seigal; Chandler Squires; Caroline Uhler (2022). Linear causal disentanglement via interventions. 
[b82] Amanpreet Singh; Ronghang Hu; Vedanuj Goswami; Guillaume Couairon; Wojciech Galuba; Marcus Rohrbach; Douwe Kiela (2021). FLAVA: A foundational language and vision alignment model. 
[b83] Jake Snell; Kevin Swersky; Richard S Zemel (2017). Prototypical networks for few-shot learning. 
[b84] Peter Sorrenson; Carsten Rother; Ullrich Köthe (2020). Disentanglement by nonlinear ICA with general incompressible-flow networks (GIN). 
[b85] Trevor Standley; Dawn Amir Roshan Zamir; Leonidas J Chen; Jitendra Guibas; Silvio Malik;  Savarese (2020-07). Which tasks should be learned together in multi-task learning. PMLR
[b86] Baochen Sun; Jiashi Feng; Kate Saenko (2017). Correlation alignment for unsupervised domain adaptation. Springer
[b87] Baochen Sun; Kate Saenko (2016). Deep coral: Correlation alignment for deep domain adaptation. Springer
[b88] Victor Veitch; D' Alexander; Steve Amour; Jacob Yadlowsky;  Eisenstein (2021). Counterfactual invariance to spurious correlations: Why and how to pass stress tests. 
[b89] Hemanth Venkateswara; Jose Eusebio; Shayok Chakraborty; Sethuraman Panchanathan (2017). Deep hashing network for unsupervised domain adaptation. IEEE Computer Society
[b90] Oriol Vinyals; Charles Blundell; Tim Lillicrap; Koray Kavukcuoglu; Daan Wierstra (2016). Matching networks for one shot learning. 
[b91] Catherine Wah; Steve Branson; Peter Welinder; Pietro Perona; Serge Belongie (2011). The caltech-ucsd birds-200-2011 dataset. 
[b92] Zihao Wang; Victor Veitch (2022). A unified causal view of domain invariant representation learning. 
[b93] Zirui Wang; Zihang Dai; Barnabás Póczos; Jaime G Carbonell (2019). Characterizing and avoiding negative transfer. Computer Vision Foundation / IEEE
[b94] Martin Wattenberg; Fernanda Viégas; Ian Johnson (2016). How to use t-sne effectively. Distill
[b95] Florian Wenzel; Andrea Dittadi; Peter V Gehler; Carl-Johann Simon-Gabriel; Max Horn; Dominik Zietlow; David Kernert; Chris Russell; Thomas Brox; Bernt Schiele; Bernhard Schölkopf; Francesco Locatello (2022). Assaying out-of-distribution generalization in transfer learning. 
[b96] Olivia Wiles; Sven Gowal; Florian Stimberg;  Sylvestre-Alvise; Ira Rebuffi; Krishnamurthy Ktena; Ali Dvijotham; Cemgil Taylan (2022). A fine-grained analysis on distribution shift. 
[b97] Matthew Willetts; Brooks Paige (2021). I don't need u: Identifiable non-linear ica without side information. 
[b98] Thomas Wolf; Lysandre Debut; Victor Sanh; Julien Chaumond; Clement Delangue; Anthony Moi; Pierric Cistac; Tim Rault; Rémi Louf; Morgan Funtowicz (2019). Huggingface's transformers: State-of-the-art natural language processing. 
[b99] Mitchell Wortsman; Gabriel Ilharco; Yitzhak Samir; Rebecca Gadre; Raphael Gontijo Roelofs; Ari S Lopes; Hongseok Morcos; Ali Namkoong; Yair Farhadi; Simon Carmon; Ludwig Kornblith;  Schmidt (2022-07-23). Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. PMLR
[b100] Jianxiong Xiao; James Hays; Krista A Ehinger; Aude Oliva; Antonio Torralba (2010-06-18). SUN database: Large-scale scene recognition from abbey to zoo. IEEE Computer Society
[b101] Huan Shen Yan; Nanxiang Song; Lincan Li; Liu Zou;  Ren (2020). Improve unsupervised domain adaptation with mixup training. 
[b102] Weiran Yao; Yuewen Sun; Alex Ho; Changyin Sun; Kun Zhang (2022). Learning temporally causal latent processes from general temporal data. 
[b103] Lu Yuan; Dongdong Chen; Yi-Ling Chen; Noel Codella; Xiyang Dai; Jianfeng Gao; Houdong Hu; Xuedong Huang; Boxin Li; Chunyuan Li; Ce Liu; Mengchen Liu; Zicheng Liu; Yumao Lu; Yu Shi; Lijuan Wang; Jianfeng Wang; Bin Xiao; Zhen Xiao; Jianwei Yang; Michael Zeng; Luowei Zhou; Pengchuan Zhang (2021). Florence: A new foundation model for computer vision. 
[b104] Marvin Zhang; Henrik Marklund; Nikita Dhawan; Abhishek Gupta; Sergey Levine; Chelsea Finn (2021). Adaptive risk minimization: Learning to adapt to domain shift. Advances in Neural Information Processing Systems
[b105] Yu Zhang; Qiang Yang (2018). An overview of multi-task learning. National Science Review
[b106] Bolei Zhou; Agata Lapedriza; Aditya Khosla; Aude Oliva; Antonio Torralba (2017). Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence
[b107] Kaiyang Zhou; Ziwei Liu; Yu Qiao; Tao Xiang; Chen Change Loy (2021). Domain generalization: A survey. 
[b108] Jinguo Zhu; Xizhou Zhu; Wenhai Wang; Xiaohua Wang; Hongsheng Li; Xiaogang Wang; Jifeng Dai (2022). Uni-perceiver-moe: Learning sparse generalist models with conditional moes. 

Figures:
Figure fig_0: 1
Type: figure
Caption: Figure 1 :1Figure 1: Model scheme: Illustrations of the (Top) the inner loop stage and outer loop following the steps of the algorithmic procedure described in Section B.1 in the Appendix. properties. It is composed of the weighted sum of a sparsity penalty Reg L1 and an entropy-based feature sharing penalty: Reg sharing Reg(ϕ) = αReg L1 (ϕ) + βReg sharing (ϕ),(1)
Data: 

Figure fig_1: 2
Type: figure
Caption: Figure 2 :2Figure 2: Assumed causal generative model: the gray variables are unobserved. Observations x are generated by some unknown mixing of a set of factors of variations z. Additionally, we observe a distribution of supervised tasks, only depending on a subset of factors of variations indexed by S.
Data: 

Figure fig_2: 4
Type: figure
Caption: Figure 4 :4Figure 4: Task compositional generalization: Mean accuracy over 100 random test tasks reported for group of tasks of growing support (second, third, fourth column) for a model trained without inductive biases (blue, attaining DCI = 29.4) and enforcing them (orange, DCI = 59.4). The latter show better compositional generalization resulting from the properties enforced on the representation. Exact values are reported in Table9in Appendix.
Data: 

Figure fig_3: 5
Type: figure
Caption: Figure 5 :5Figure 5: Quantitative results on CivilComments: we report the accuracy on test averaged across all demographic groups (left group), and the worst group accuracy, on the right. Our method (green) performs similarly in terms of average accuracy and outperforms in terms of worst group accuracy, without using any knowledge on the group composition in the training data. For exact values and error estimates, seeTable 10 in the Appendix.
Data: 

Figure tab_0: 2
Type: table
Caption: Quantitative results for few-shot transfer learning, with our method consistently outperforming ERM across all sample sizes and data sets.
Data: N-shot/Algorithm OOD accuracy (averaged by domains)1-shotPACSVLCS OfficeHomeWaterbirdsERM80.559.756.479.8Ours81.568.258.488.45-shotERM87.171.775.779.8Ours88.374.577.087.610-shotERM87.974.081.084.2Ours90.477.382.089.2

Figure tab_1: 3
Type: table
Caption: Quantitative evaluation on Camelyon17: we report accuracy both on ID and OOD splits. Our approach achieves significantly higher validation and test OOD accuracy.
Data: Validation(ID) Validation (OOD) Test (OOD)ERM93.28470.3CORAL95.486.259.5IRM91.686.264.2Ours93.2 ±0.389.9±0.674.1±0.2

Figure tab_2: 
Type: table
Caption: Table 10 in the Appendix.
Data: DomainBed. We evaluate the domain generalizationperformance on the PACS, VLCS and OfficeHomedatasets from the DomainBed [32] test suite (see Ap-pendix C.1 for more details). On these datasets, wetrain on N -1 and leave one out for testing. Reg-ularization parameters α and β are tuned accordingto validation sets of PACS, and used accordingly onthe other dataset. For these experiments we use aResNet50 pretrained on Imagenet [17] as a back-bone, as done in [32] To fit the linear head we sam-ple 10 times with different samples sizes from thetraining domains and we report the mean score andstandard deviation. Results are reported in Table 4,showing how enforcing sparse sufficiency and mini-mality leads consistently to better OOD performance.Comparisons with 13 additional baselines is in Ap-pendix D.4.

Figure tab_3: 4
Type: table
Caption: Results for domain generalization on DomainBed. Our approach achieves consistently higher average OOD generalization, outperforming ERM in all cases except one.
Data: Dataset/AlgorithmOOD accuracy (by domain)PACSSAPCAverageERM77.9 ± 0.4 88.1 ± 0.1 97.8 ± 0.0 79.1 ± 0.985.7Ours83.1 ± 0.1 86.7± 0.8 97.8 ± 0.1 83.5 ± 0.187.5VLCSCLVSAverageERM97.6± 1.063.3 ± 0.9 76.4 ± 1.5 72.2 ± 0.577.4Ours98.1± 0.2 63.4± 0.5 78.2 ± 0.7 73.9± 0.878.4OfficeHomeCAPRAverageERM53.4± 0.662.7 ± 1.1 76.5 ± 0.477.3 ± 0.67.5Ours56.3± 0.1 66.7 ± 0.7 79.2± 0.5 81.3 ± 0.470.9WaterbirdsLLLWWLWWAverageERM98.6 ± 0.352.05 ± 368.5 ± 393 ± 0.381.3Ours99.5 ± 0.1 73.0 ± 2.585.0 ± 295.5 ± 0.490.54.3 Few-shot transfer learning.


Formulas:
Formula formula_0: Reg(ϕ) = αReg L1 (ϕ) + βReg sharing (ϕ) (1)

Formula formula_2: Reg L1 (ϕ) = 1 T C t,c,m |ϕ t,m,c |(2)

Formula formula_3: Reg sharing (ϕ) = H( φm ) = - m φm log( φm )(3)

Formula formula_4: min θ E t [L outer (f ϕ * (g θ (x Q t ), y Q t ))],(4)

Formula formula_5: min ϕ 1 T t L inner (ŷ U t , y U t ) + Reg(ϕ),(5)


Formula formula_7: f * t = f * t | St for S t ∼ p(S) 2. minimality: ̸ ∃S ′ ̸ = S t ⊂ S s.t. f * t | S ′ = f * t ,

Formula formula_8: p(S ∩ S ′ = {i}) > 0 or p({i} ∈ (S ∪ S ′ ) -(S ′ ∩ S)) > 0.
