['1c1', '< Title: Leveraging sparse and shared feature activations for disentangled representation learning', '---', '> Title: Disentangling Representations via Sparse and Shared Feature Activations in Multi-Task Learning', '3c3', '< Abstract: Research on recovering the latent factors of variation of high dimensional data has so far focused on simple synthetic settings. Mostly building on unsupervised and weakly-supervised objectives, prior work missed out on the positive implications for representation learning on real world data. In this work, we propose to leverage knowledge extracted from a diversified set of supervised tasks to learn a common disentangled representation. Assuming that each supervised task only depends on an unknown subset of the factors of variation, we disentangle the feature space of a supervised multi-task model, with features activating sparsely across different tasks and information being shared as appropriate. Importantly, we never directly observe the factors of variations, but establish that access to multiple tasks is sufficient for identifiability under sufficiency and minimality assumptions. We validate our approach on six real world distribution shift benchmarks, and different data modalities (images, text), demonstrating how disentangled representations can be transferred to real settings.', '---', '> Abstract: Prior research on disentangled representation learning (DRL) for high-dimensional data has predominantly focused on synthetic environments, limiting its applicability to real-world scenarios. This reliance on unsupervised or weakly-supervised objectives often overlooks the practical benefits of DRL for complex, real-world data. We introduce a novel approach that leverages knowledge from a diversified set of supervised tasks to learn a common, robustly disentangled representation. Our method operates within a supervised multi-task learning framework, where each task is assumed to depend on an unknown, sparse subset of the underlying factors of variation. We achieve disentanglement by enforcing sparse feature activations across tasks and promoting maximal information sharing where appropriate. Crucially, our framework achieves identifiability without direct observation of the factors of variation, relying instead on the inherent structure provided by multiple supervised tasks under proposed sufficiency and minimality assumptions. We rigorously validate our approach across six real-world distribution shift benchmarks and diverse data modalities (images, text), showcasing the effective transferability of disentangled representations to practical scenarios.', '6,13c6,13', '< A fundamental question in deep learning is how to learn meaningful and reusable representation from high dimensional data observations [8,75,78,77]. A core area of research pursuing is centered on disentangled representation learning (DRL) [56,8,33] where the aim is to learn a representation which recovers the factors of variations (FOVs) underlying the data distribution. Disentangled representations are expected to contain all the information present in the data in a compact and interpretable structure [46,16] while being independent from a particular task [29]. It has been argued that separating information into interventionally independent factors [78] can enable robust downstream predictions, which was partially validated in synthetic settings [19,58]. Unfortunately, these benefits did not materialize in real world representations learning problems, largely limited by a lack of scalability of existing approaches.', '< In this work we focus on leveraging knowledge from different task objectives to learn better representations of high dimensional data, and explore the link with disentanglement and out-of-distribution (OOD) generalization on real data distributions. Representations learned from a large diversity of tasks are indeed expected to be richer and generalize better to new, possibly out-of-distribution, tasks. However, this is not always the case, as different tasks can compete with each other and lead to weaker models. This phenomenon, known as negative transfer [61,91] in the context of transfer learning or task competition [83] in multitask learning, happens when a limited capacity model is used to learn two different tasks that require expressing high feature variability and/or coverage. Aiming to use the same features for different objectives makes them noisy and often increases the sensitivity to spurious correlations [35,27,7], as features can be both predictive and detrimental for different tasks. Instead, we leverage a diverse set of tasks and assume that each task only depends on an unknown subset of the factors of variation. We show that disentangled representations naturally emerge without any annotation of the factors of variations under the following two representation constraints:', '< • Sparse sufficiency: Features should activate sparsely with respect to tasks. The representation is sparsely sufficient in the sense that any given task can be solved using few features.', '< • Minimality: Features are maximally shared across tasks whenever possible. The representation is minimal in the sense that features are encouraged to be reused, i.e., duplicated or split features are avoided.', '< These properties are intuitively desirable to obtain features that (i) are disentangled w.r.t. to the factors of variations underlying the task data distribution (which we also theoretically argue in Proposition 2.1), (ii) generalize better in settings where test data undergo distribution shifts with respect to the training distributions, and (iii) suffer less from problems related to negative transfer phenomena. To learn such representations in practice, we implement a meta learning approach, enforcing feature sufficiency and sharing with a sparsity regularizer and an entropy based feature sharing regularizer, respectively, incorporated in the base learner. Experimentally, we show that our model learns meaningful disentangled representations that enable strong generalization on real world data sets. Our contributions can be summarized as follows:', '< • We demonstrate that is possible to learn disentangled representations leveraging knowledge from a distribution of tasks. For this, we propose a meta learning approach to learn a feature space from a collection of tasks while incorporating our sparse sufficiency and minimality principles favoring task specific features to coexist with general features.', '< • Following previous literature, we test our approach on synthetic data, validating in an idealized controlled setting that our sufficiency and minimality principles lead to disentangled features w.r.t. the ground truth factors of variation, as expected from our identifiability result in Proposition 2.1.', '< • We extend our empirical evaluation to non-synthetic data where factors of variations are not known, and show that our approach generalizes well out-of-distribution on different domain generalization and distribution shift benchmarks.', '---', '> Learning meaningful and reusable representations from high-dimensional data is a central challenge in deep learning [8,75,78,77]. Disentangled Representation Learning (DRL) [56,8,33] addresses this by aiming to recover the underlying factors of variation (FOVs) of the data distribution. Ideally, disentangled representations capture all relevant information in a compact, interpretable structure [46,16], independent of any single task [29]. The separation of information into interventionally independent factors [78] is posited to enable robust downstream predictions, a benefit partially validated in synthetic settings [19,58]. However, these theoretical advantages have yet to fully materialize in real-world representation learning problems, primarily due to the scalability limitations of current DRL approaches.', '> Herein, we address these limitations by leveraging knowledge from diverse supervised task objectives to learn superior representations for high-dimensional data. We specifically investigate the intrinsic link between disentanglement and out-of-distribution (OOD) generalization in real-world data settings. While a broad spectrum of tasks is expected to yield richer, more generalizable representations, this is often hindered by task competition, leading to suboptimal models. This phenomenon, termed negative transfer [61,91] in transfer learning or task competition [83] in multi-task learning, arises when a model with limited capacity attempts to learn disparate tasks requiring high feature variability or coverage. Using shared features for conflicting objectives can introduce noise and amplify sensitivity to spurious correlations [35,27,7], as features may be simultaneously predictive and detrimental across tasks. To mitigate this, we propose a framework that utilizes a diverse set of tasks, each assumed to depend on an unknown, sparse subset of the factors of variation. We demonstrate that disentangled representations naturally emerge without explicit annotation of these factors, guided by two fundamental representation constraints:', '> • Sparse Sufficiency: Features are activated sparsely for each task, ensuring that any given task can be solved using a minimal subset of features. This promotes task-specific relevance and reduces irrelevant feature activations.', '> • Minimality: Features are maximally shared across tasks, preventing duplication and encouraging the reuse of features. This leads to a compact and efficient representation by avoiding redundant information.', '> These properties are inherently desirable, yielding features that (i) are disentangled with respect to the underlying factors of variation of the task data distribution (as theoretically supported by Proposition 2.1), (ii) exhibit enhanced generalization capabilities in the presence of distribution shifts, and (iii) effectively mitigate negative transfer phenomena. Practically, we implement these principles within a meta-learning framework, employing a sparsity regularizer for feature sufficiency and an entropy-based regularizer for feature sharing, both integrated into the base learner. Our experiments demonstrate that this model learns meaningful disentangled representations, leading to strong generalization performance on real-world datasets. Our key contributions are:', '> • We propose a novel meta-learning framework that successfully learns disentangled representations by leveraging knowledge from a distribution of diverse tasks. This framework uniquely integrates sparse sufficiency and minimality principles, enabling the coexistence of task-specific and generalizable features.', '> • We rigorously validate our theoretical identifiability result (Proposition 2.1) on synthetic datasets, demonstrating that our sufficiency and minimality principles effectively recover ground-truth factors of variation in idealized, controlled settings.', "> • We significantly extend our empirical evaluation to complex, non-synthetic datasets lacking explicit factor annotations, showcasing our approach's strong out-of-distribution generalization capabilities across various domain generalization and distribution shift benchmarks.", '16c16', '< Given a distribution of tasks t ∼ T and data (x t , y t ) ∼ P t for each task t, we aim to learn a disentangled representation g(x) = ẑ ∈ Ẑ ⊆ R M , which generalizes well to unseen tasks. We learn this representation g by imposing the sparse sufficiency and minimality inductive biases.', '---', '> Given a distribution of tasks t ∼ T , each with associated data (x t , y t ) ∼ P t , our objective is to learn a disentangled representation g(x) = ẑ ∈ Ẑ ⊆ R M . This representation is designed to generalize effectively to unseen tasks, and its learning is guided by the explicit imposition of sparse sufficiency and minimality inductive biases.', '19,22c19,22', '< Our architecture (see Figure 1) is composed of a backbone module g θ that is shared across all tasks and a separate linear classification head f ϕt , which is specific to each task t. The backbone is responsible to compute and learn a general feature representation for all classification tasks. The linear head solves a specific classification problem for the task-specific data (x t , y t ) ∼ P t in the feature space Ẑ while enforcing the feature sufficiency and minimality principles. Adopting the typical meta-learning setting [34], the backbone module g θ can be viewed as the meta learner while the task-specific classification heads f ϕt can be viewed as the base learners. In the meta-learning setting we assume to have access to samples for a new task given by a support set U , with elements (x U , y U ) ∈ U . These samples are used to fit the linear head f ϕ * leading to the optimal feature weights for the given task. For a query x Q ∈ Q, the prediction is obtained by computing the forward pass ŷ = f ϕ * (g θ (x Q )).', '< Enforcing feature minimality and sufficiency. To solve a task in the feature space Ẑ of the backbone module we impose the following regularizer Reg(ϕ) on the classification heads f ϕ with parameter ϕ ∈ R T ×M ×C , where T is the number of tasks, M the number of features, and C the number of classes. The regularizer is responsible for enforcing the feature minimality and sufficiency ', '< x U g θ f ϕ ẑU g θ ŷU L inner x Q g θ f ϕ * ϕ * ẑQ g θ ŷQ L outer', '< with scalar weights α and β. The penalty terms are defined by:', '---', '> Our architecture (see Figure 1) comprises a shared backbone module g θ and distinct linear classification heads f ϕt , each specific to a task t. The backbone learns a general feature representation common to all classification tasks. Each linear head then solves its respective classification problem within the feature space Ẑ, while simultaneously enforcing the feature sufficiency and minimality principles. Following the standard meta-learning paradigm [34], g θ acts as the meta-learner, and the task-specific f ϕt serve as base learners. For a new task, we assume access to a support set U , containing samples (x U , y U ) ∈ U . These samples are used to fit the optimal linear head f ϕ * for that task. Predictions for a query x Q ∈ Q are then obtained via the forward pass ŷ = f ϕ * (g θ (x Q )).', '> Enforcing feature minimality and sufficiency. To effectively solve tasks within the feature space Ẑ of the backbone module, we introduce a composite regularizer, Reg(ϕ), applied to the classification heads f ϕ . The parameters of these heads are denoted by ϕ ∈ R T ×M ×C , where T is the number of tasks, M is the number of features, and C is the number of classes. This regularizer is designed to enforce both feature minimality and sufficiency. With scalar weights α and β, the overall regularizer is given by:', '> Reg(ϕ) = αReg L1 (ϕ) + βReg sharing (ϕ) (1)', '> The individual penalty terms are defined as:', '27,28c27', '< Section: T C', '< t,c |ϕt,c,m| t,c,m |ϕt,c,m| are the normalized classifier parameters. Sufficiency is enforced by a sparsity regularizer given by the L 1 -norm, which constrains classification head to use only a sparse subset of the features. Minimality is enforced by the feature sharing term: minimizing the entropy of the distribution of feature importances (i.e. normalized |ϕ t |) averaged across a mini batch of T tasks, leads to a more peaked distribution of activations across tasks. This forces features to cluster across tasks and therefore be reused by different tasks, when useful.We remark that different choices for the regularizers coming from the linear multitask learning literature (e.g. [59,39,38]) to enforce sparse sufficiency and minimality are indeed possibile. We leave their exploration as a future direction.', '---', '> Here, φm = (1/T) Σ t,c |ϕ t,c,m| / Σ t,c,m |ϕ t,c,m| represents the normalized importance of feature m, averaged across tasks. Sparse sufficiency is enforced by the L1-norm regularizer, which compels the classification head to utilize only a sparse subset of features. Minimality is enforced by the feature sharing term: minimizing the entropy of the distribution of feature importances (i.e., normalized |ϕ t |) averaged across a mini-batch of T tasks, which results in a more peaked distribution of activations across tasks. This mechanism promotes feature clustering across tasks, encouraging their reuse when beneficial. We note that alternative regularizers from the linear multi-task learning literature (e.g., [59,39,38]) could also enforce sparse sufficiency and minimality; exploring these remains a promising avenue for future work.', '31c30', '< We train the model in meta-learning fashion by minimizing the test error over the expectation of the task distribution t ∼ T . This can be formalized as a bi-level optimization problem. The optimal backbone model g θ * is given by the outer optimization problem:', '---', '> We train our model using a meta-learning approach, minimizing the expected test error across the task distribution t ∼ T . This process is formulated as a bi-level optimization problem. The optimal backbone model g θ * is determined by the outer optimization problem:', '33c32', '< where f ϕ * are the optimal classifiers obtained from solving the inner optimization problem, and (x Q t , y Q t ) ∈ Q t are the test (or query) datum from the query set Q t for task t. Let U t be the support set with samples (x U t , y U t ) ∈ U for task t, where typically the support set is distinct from the query set, i.e., U ∩ Q = ∅. The optimal classifiers f ϕ * are given by the inner optimization problem:', '---', '> where f ϕ * denotes the optimal classifiers derived from the inner optimization problem. Here, (x Q t , y Q t ) ∈ Q t represents the test (or query) data from the query set Q t for task t. Conversely, the optimal classifiers f ϕ * are obtained by solving the inner optimization problem, given a support set U t with samples (x U t , y U t ) ∈ U for task t. It is typical for the support set to be distinct from the query set (i.e., U ∩ Q = ∅). The inner problem is defined as:', '35,38c34,36', '< where ŷU t = f ϕ (g θ (x U t ). For both the inner loss L inner and outer loss L outer we use the cross entropy loss.', '< Task generation. Our method can be applied in a standard supervised classification setting where we construct the tasks on the fly as follows. We define a task t as a C-way classification problem. We first select a random subset of C classes from a training domain D train which contains K train classes. For each class we consider the corresponding data points and select a random support set U t with elements (x U t , y U ) ∈ U and a disjoint random query set', '< Q t with elements (x Q t , y Q ) ∈ Q t . Algorithm.', '< In practice we solve the bi-level optimization problem ( 4) and ( 5) as follows. In each iteration we sample a batch of T tasks with the associated support and query set as described above. First, we use the samples from the support set S t to fit the linear heads f ϕ by solving the inner optimization problem (5) using stochastic gradient descent for a fixed number of steps. Second, we use the samples from the query set Q t to update the backbone g θ by solving the outer optimization problem (4) using implicit differentiation [11,31]. Since the optimal solution of the linear heads ϕ * depend on the backbone g θ , a straightforward differentiation w.r.t. θ is not possible. We remedy this issue by using the approximation strategy of [28] to compute the implicit gradients. The algorithm is summarized in section B.1 of the Appendix.', '---', '> where ŷU t = f ϕ (g θ (x U t )). Both the inner loss L inner and outer loss L outer employ the cross-entropy loss function.', '> Task generation. Our method is applicable to standard supervised classification settings, where tasks are constructed dynamically. We define each task t as a C-way classification problem. Initially, a random subset of C classes is selected from a training domain D train , which encompasses K train classes. For each chosen class, we sample corresponding data points to form a random support set U t , with elements (x U t , y U ) ∈ U , and a disjoint random query set Q t , with elements (x Q t , y Q ) ∈ Q t .', '> In practice, we solve the bi-level optimization problem defined by (4) and (5) iteratively. In each iteration, a batch of T tasks is sampled, along with their respective support and query sets. First, the linear heads f ϕ are fitted using samples from the support set S t by solving the inner optimization problem (5) via stochastic gradient descent for a fixed number of steps. Second, the backbone g θ is updated using samples from the query set Q t by solving the outer optimization problem (4). This outer optimization employs implicit differentiation [11,31], as the optimal solution of the linear heads ϕ * is dependent on the backbone g θ , precluding direct differentiation with respect to θ. We address this dependency by utilizing the approximation strategy from [28] to compute the implicit gradients. A comprehensive summary of the algorithm is provided in Section B.1 of the Appendix.', '41,43c39,43', '< We analyze the implications of the proposed minimality and sparse sufficiency principles and show in a controlled setting that they indeed lead to identifiability. As outlined in Figure 2, we assume that there exists a set of independent latent factors z ∼ d i=1 p(z i ) that generate the observations via an unknown mixing function x = g * (z). Additionally, we assume that the labels y t for a task t only depend on a subset of the factors indexed by S t ∼ P (S), where S is an index set on z ∈ Z, via some unknown mixing function y t = f * t (z) (potentially different for different tasks). We formalize the two principles that are imposed on f * by: 1. sufficiency:', '< f * t = f * t | St for S t ∼ p(S) 2. minimality: ̸ ∃S ′ ̸ = S t ⊂ S s.t. f * t | S ′ = f * t ,', '< where f | St denotes that the input to a function f is restricted to the index set given by S t (all remaining entries are set to zero). ( 1) states that f * t only uses a subset of features, and (2) states that there are not be duplicate features. Proposition 2.1. Assume that g * is a diffeomorphism (smooth with smooth inverse), f * satisfies the sufficiency and minimality properties stated above, and p(S) satisfies:', '---', '> We present a theoretical analysis of the proposed minimality and sparse sufficiency principles, demonstrating their role in achieving identifiability within a controlled setting. As depicted in Figure 2, we assume the existence of a set of independent latent factors z ∼ d i=1 p(z i ) that generate observations x through an unknown mixing function x = g * (z). Furthermore, we assume that the labels y t for a task t depend solely on a subset of these factors, indexed by S t ∼ P (S) (where S is an index set on z ∈ Z), via some unknown mixing function y t = f * t (z) (which may vary across tasks). We formalize the two principles imposed on f * as follows: 1. Sufficiency:', '> f * t = f * t | St for S t ∼ p(S)', '> 2. Minimality: ̸ ∃S ′ ̸ = S t ⊂ S s.t. f * t | S ′ = f * t ,', '> where f | St indicates that the input to function f is restricted to the index set S t (with all other entries set to zero). Principle (1) asserts that f * t exclusively utilizes a subset of features, while (2) ensures the absence of duplicate features.', '> Proposition 2.1. Assume that g * is a diffeomorphism (i.e., smooth with a smooth inverse), f * satisfies the aforementioned sufficiency and minimality properties, and p(S) satisfies:', '45,47c45,47', '< Observing unlimited data from p(X, Y ), it is possible to recover a representation ẑ that is an axis aligned, component wise transformation of z.', '< Remarks: Overall, we see this proposition as validation that in an idealized setting our inductive biases are sufficient to recover the factors of variation. Note that the proof is non-constructive and does not entail a specific method. In practice, we rely on the same constraints as inductive biases that lead to this theoretical identifiability and experimentally show that disentangled representations emerge in controlled synthetic settings. On real data, (1) we cannot directly measure disentanglement, (2) a notion of global ground-truth factors may even be ill-posed, and (3) the assumptions of Proposition 2.1 are likely violated. Still, sparse sufficiency and minimality yield some meaningful factorization of the representation for the considered tasks.', '< Relation to [47] and [58]: Our theoretical result can be reconnected with concurrent work [47] and can be seen as a corollary with a different proof technique and slightly relaxed assumptions. The main difference is that our feature minimality allows us to also cover the case where the number of factors of variations is unknown, which we found critical in real world data sets (the main focus of our paper). Instead, they only assume sparse sufficiency, which is enough for identifiability if the ground-truth number of factors is known, but is not enough to recover high disentaglement when this is not the case (see Figure 3) and does not translate well to real data, see Table 16 with the empirical comparison in Appendix D.8. Interestingly, their analysis also hints at the fact that our approach also benefits in terms of sample complexity on transfer learning downstream tasks. Our proof technique follows the general construction developed for multi-view data in [58], adapted to our different setting. Instead of observing multiple views with shared factors of variation, we observe a single task that only depend on a subset of the factors.', '---', '> Under these conditions, by observing unlimited data from p(X, Y ), it is possible to recover a representation ẑ that is an axis-aligned, component-wise transformation of z.', '> Remarks: This proposition serves as a crucial theoretical validation, demonstrating that in an idealized setting, our inductive biases are sufficient for recovering the factors of variation. It is important to note that the proof is non-constructive and does not prescribe a specific method. In practice, we leverage these same constraints as inductive biases, experimentally showing the emergence of disentangled representations in controlled synthetic environments. For real-world data, direct disentanglement measurement is challenging, a global notion of ground-truth factors might be ill-posed, and the assumptions of Proposition 2.1 are likely to be violated. Nevertheless, sparse sufficiency and minimality consistently yield meaningful factorizations of the representation for the tasks under consideration.', '> Relation to [47] and [58]: Our theoretical result aligns with concurrent work [47] and can be viewed as a corollary derived using a distinct proof technique and slightly relaxed assumptions. A key differentiator is our incorporation of feature minimality, which uniquely enables our framework to handle scenarios where the number of factors of variation is unknown—a critical aspect for real-world datasets, which form the primary focus of this paper. In contrast, [47] relies solely on sparse sufficiency, which, while sufficient for identifiability when the ground-truth number of factors is known, proves inadequate for achieving high disentanglement in its absence (see Figure 3) and translates poorly to real-world data (as shown in Table 16 and the empirical comparison in Appendix D.8). Intriguingly, their analysis also suggests that our approach offers benefits in terms of sample complexity for downstream transfer learning tasks. Our proof technique extends the general construction developed for multi-view data in [58], adapting it to our unique setting where we observe a single task dependent on a subset of factors, rather than multiple views with shared factors of variation.', '50,52c50,52', '< Learning from multiple tasks and domains. Our method addresses the problem of learning a general representation across multiple and possibly unseen tasks [15,103] and environments [105,32,44,97,63,94,64] that may be competing with each other during training [61,91,83]. Prior research tackled task competition by introducing task specific modules that do not interact during training [67,101,80]. While successfully learning specialized modules, these approaches can not leverage synergistic information between tasks, when present. On the other hand, our approach is closer to multi-task methods that aim at learning a generalist model, leveraging multi-task interactions [106,5]. Other approaches that leverage a meta-learning objective for multi-task learning have been formulated [18,81,50,9]. In particular, [50] proposes to learn a generalist model in a few-shot learning setting without explicitly favoring feature sharing, nor sparsity. Instead, we rephrase the multi-task objective function encoding both feature sharing and sparsity to avoid task competition.', '< Similar to prior work in domain generalization, we assume the existence of stable features for a given task [64,4,86,40,90] and amortize the learning over the multiple environments. Differently than prior work, we do not aim to learn an invariant representation a priori. Instead, we learn sufficient and minimal features for each task, which are selected at test time fitting the linear head on them. In light of [32], one can interpret our approach as learning the final classifier using empirical risk minimization but over features learned with information from the multiple domains.', '< Disentangled representations. Disentanglement representation learning [8,33] aims at recovering the factors of variations underlying a given data distribution. [56] proved that without any form of supervision (whether direct or indirect) on the Factors of Variation (FOV) is not possible to recover them. Much work has then focused on identifiable settings [58,25] from non-i.i.d. data, even allowing for latent causal relations between the factors. Different approaches can be largely grouped in two categories. First, data may be non-independently sampled, for example assuming sparse interventions or a sparse latent dynamics [30,55,13,100,2,79,48]. Second, data may be non-identically distributed, for example being clustered in annotated groups [37,41,82,95,60]. Our method follows the latter, but we do not make assumptions on the factor distribution across tasks (only their relevance in terms of sufficiency and minimality). This is also reflected in our method, as we train for supervised classification as opposed to contrastive or unsupervised learning as common in the disentanglement literature. The only exception is the work of [47] discussed in Section 2.3.', '---', '> Learning from multiple tasks and domains. Our method addresses the critical challenge of learning a generalizable representation across diverse and potentially unseen tasks [15,103] and environments [105,32,44,97,63,94,64], particularly when these tasks exhibit competition during training [61,91,83]. Previous research has attempted to resolve task competition by employing task-specific modules that operate independently during training [67,101,80]. While these methods effectively learn specialized modules, they often fail to leverage synergistic information that might exist between tasks. In contrast, our approach aligns more closely with multi-task methods that aim to learn a generalist model by explicitly exploiting multi-task interactions [106,5]. While other meta-learning objectives have been proposed for multi-task learning [18,81,50,9], notably [50] learns a generalist model in a few-shot setting without explicitly promoting feature sharing or sparsity. Our work distinguishes itself by rephrasing the multi-task objective function to intrinsically encode both feature sharing and sparsity, thereby directly mitigating task competition.', '> Similar to prior work in domain generalization, we posit the existence of stable features for a given task [64,4,86,40,90] and amortize learning across multiple environments. However, unlike conventional approaches, we do not aim to learn an invariant representation a priori. Instead, our method learns sufficient and minimal features for each task, which are adaptively selected at test time by fitting a linear head. Following the perspective of [32], our approach can be interpreted as learning the final classifier through empirical risk minimization, but operating on features enriched with information from multiple domains.', '> Disentangled representations. Disentangled Representation Learning (DRL) [8,33] is fundamentally concerned with recovering the latent factors of variation that govern a given data distribution. A foundational result by [56] established that without some form of supervision (direct or indirect) on these Factors of Variation (FOVs), their recovery is generally impossible. Consequently, much subsequent work has shifted towards identifiable settings [58,25], often leveraging non-i.i.d. data and even accommodating latent causal relations between factors. These approaches typically fall into two broad categories: (1) methods where data is non-independently sampled, often assuming sparse interventions or sparse latent dynamics [30,55,13,100,2,79,48]; and (2) methods where data is non-identically distributed, such as being clustered into annotated groups [37,41,82,95,60]. Our method aligns with the second category, but crucially, we make no explicit assumptions about the factor distribution across tasks; instead, we focus solely on their relevance in terms of sparse sufficiency and minimality. This design choice is further reflected in our supervised classification training objective, which contrasts with the more common contrastive or unsupervised learning paradigms in the disentanglement literature. The work of [47], discussed in Section 2.3, represents a notable exception.', '55,57c55,57', '< We start by highlighting here the experimental setup of this paper along with its motivation. Synthetic experiments. We first evaluate our method on benchmarks from the disentanglement literature [62,14,71,49] where we have access to ground-truth annotations and we can assess quantitatively how well we can learn disentangled representations. We further investigate how minimality and feature sharing are correlated with disentanglement measures (Section 4.1) and how well our representations, which are learned from a limited set of tasks, generalize their composition. The purpose of these experiments is to validate our theoretical statement, showing that if the assumptions of Proposition 2.1 hold, our methods quantitatively recover the factors of variation. Domain generalization. On real data sets, we can neither quantitatively measure disentanglement nor are we guaranteed identifiability (as assumptions may be violated). Ultimately, the goal of disentangled representations is to learn features that lend themselves to be easily and robustly transferred to downstream tasks. Therefore, we first evaluate the usefulness of our representations with respect to downstream tasks subject to distribution shifts, where isolating spurious features was found to improve generalization in synthetic settings [19,58] To assess how robust our representations are to distribution shifts, we evaluate our method on domain generalization and domain shift tasks on six different benchmarks (Section 4.2). In a domain generalization setting, we do not have access to samples coming from the testing domain, which is considered to be OOD w.r.t. to the training domains. However, in order to solve a new task, our method relies on a set labeled data at test time to fit the linear head on top of the feature space. Our strategy is to sample data points from the training distribution, balanced by class, assuming that the label set Y does not change in the testing domain, although its distribution may undergo subpopulation shifts.', '< Few-shot transfer learning. Lastly, we test the adaptability of the feature space to new domains with limited labeled samples. For transfer learning tasks, we fit a linear head using the available limited supervised data. The sparsity penalty α is set to the value used in training; the feature sharing parameter β is defaulted to zero unless specified.', '< Experimental setting. To have a fair comparison with other methods in the literature, we adopt the standard experimental setting of prior work [32,44]. Hyperparameters α and β are tuned performing model selection on validation set, unless specified otherwise. For comparison with baselines, we substitute our backbone with that of the baseline (e.g. for ERM models, we detach the classification head) and then fit a new linear head on the same data. The linear head module trained at test time on top of the features is the same both for our and compared methods. Despite its simplicity, we report the ERM baseline for comparison in our experiments in the main paper, since it has been shown to perform best in average on domain generalization benchmarks [32,44]. We further compare with other consolidated approaches in the literature such as IRM [4], CORAL [85] and GroupDRO [73] and include a large and comprehensive comparison with [99,10,51,53,26,54,65,102,36,45] in AppendixD. 4. Experimental details are fully described in Appendix C.', '---', '> This section details our experimental setup and its underlying motivations. Synthetic experiments. We begin by evaluating our method on established disentanglement benchmarks [62,14,71,49], which provide ground-truth annotations. This allows for a quantitative assessment of our ability to learn disentangled representations. We further investigate the correlation between minimality, feature sharing, and disentanglement measures (Section 4.1), as well as the compositional generalization capabilities of representations learned from a limited set of tasks. These experiments serve to validate our theoretical claims, demonstrating that under the assumptions of Proposition 2.1, our methods quantitatively recover the true factors of variation. Domain generalization. For real-world datasets, where ground-truth disentanglement cannot be quantitatively measured and identifiability is not guaranteed (due to potential assumption violations), we shift our focus to the practical utility of disentangled representations. The ultimate goal is to learn features that transfer easily and robustly to downstream tasks. Thus, we first evaluate the efficacy of our representations on downstream tasks subject to distribution shifts, a context where isolating spurious features has been shown to enhance generalization in synthetic settings [19,58]. To assess the robustness of our representations to such shifts, we evaluate our method on domain generalization and domain shift tasks across six distinct benchmarks (Section 4.2). In a domain generalization scenario, we operate without access to samples from the testing domain, which is considered out-of-distribution (OOD) relative to the training domains. However, to solve a new task, our method relies on a small set of labeled data at test time to fit a linear head atop the learned feature space. Our strategy involves sampling class-balanced data points from the training distribution, assuming the label set Y remains constant in the testing domain, even if its distribution experiences subpopulation shifts.', '> Few-shot transfer learning. Finally, we assess the adaptability of our learned feature space to novel domains with limited labeled samples. For these transfer learning tasks, a linear head is fitted using the available sparse supervised data. The sparsity penalty α is kept consistent with its training value, while the feature sharing parameter β defaults to zero unless explicitly stated.', '> Experimental setting. To ensure a fair comparison with existing literature, we adhere to the standard experimental setup outlined in prior work [32,44]. Hyperparameters α and β are optimized via model selection on a validation set, unless otherwise specified. For baseline comparisons, we replace our backbone with that of the baseline method (e.g., detaching the classification head for ERM models) and then fit a new linear head on the same data. Critically, the linear head module trained at test time on top of the features remains identical for both our method and all comparative baselines. Despite its simplicity, the ERM baseline is included in our main paper experiments due to its demonstrated strong average performance on domain generalization benchmarks [32,44]. We also provide comparisons with other established approaches such as IRM [4], CORAL [85], and GroupDRO [73], with a more extensive comparison against [99,10,51,53,26,54,65,102,36,45] detailed in Appendix D.4. Comprehensive experimental details are provided in Appendix C.', '60,64c60,64', '< We start by demonstrating that our approach is able to recover the factors of variation underlying a synthetic data distribution like [62]. For these experiments, we assume to have partial information on a subset of factors of variation Z, and we aim to learn a representation ẑ that aligns with them while ignoring any spurious factors that may be present. We sample random tasks from a distribution T (see Appendix C.3 for details) 5and focus on binary tasks, with Y = {0, 1}. For the DSprites dataset an example of valid task is "There is a big object on the left of the image". In this case, the partially observed factors (quantized to only two values) are the x position and size. In Table 1, we show how the feature sufficiency and minimality properties enable disentanglement in the learned representations. We train two identical models on a random distribution of sparse tasks defined on FOVs, showing that, for different datasets [62,14,49,71], the same model without regularizers achieves a similar in-distribution (ID) accuracy, but a much lower disentanglement. [47] Figure 3: Role of minimality: We plot the DCI metric of a set of models (red dots) trained on fixed tasks from DSprites: Training without regularizers leads to no disentanglement (green). Enforcing sparsity alone (yellow, akin to [47]) achieves good disentanglement (DCI = 71.9), but some features may be split or duplicated. Enforcing both minimality and sparse sufficiency (magenta) attains the best DCI (98.8). When β is too high (> 0.25) activated features collapses into few clusters with respect to tasks. For complete results and experiments on additional datasets see Table 8 and Figures 6,7 in Appendix.', '< We then randomly draw and fix 2 groups of tasks with supports S 1 , S 2 (18 in total), which all have support on two FOVs, |S 1 | = |S 2 | = 2. The groups share one factor of variation and differ in the other one, i.e. S 1 ∩ S 2 = {i} for some {i} ∈ Z. The data in these tasks are subject to spurious correlations, i.e. FOVs not in the task support may be spuriously correlated with the task label. We start from an overestimate of the dimension of z of 6, trying to recover z of size 3. We train our network to solve these tasks, enforcing sufficiency and minimality on the representation with different regularization degrees. In Figure 3, we show how the alignment of the learned features with the ground truth factors of variations depend on the choice of α, β, going from no disentanglement (DCI = 27.8). to good alignment as we enforce more sufficiency and minimality. The model that attains the best alignment (DCI = 98.8) uses both sparsity and feature sharing. Sufficiency alone (akin to the method of [47]) is able to select the right support for each task, but features are split or duplicated, attaining lower disentanglement (DCI = 71.9). The feature sharing penalty ensures clustering in the feature space w.r.t. tasks, ensuring to reach high disentanglement, although it may result in the failure cases, when β is too high (β > 0.25).', '< Table 1: Enforcing disentanglement: DCI [22] disentanglement scores and ID accuracy on test samples for a model trained without enforcing sufficiency and minimality (top row), and model with the regularizers activated (bottom row). While attaining similar performance on accuracy, the model with the activated regularizer always show higher disentanglement. See Table 7 for additional scores.  9 in Appendix.', '< Disentanglement and minimality are correlated. In the synthetic setting, we also show the role of the feature sharing penalty. Minimizing the entropy of feature activations across mini-batches of tasks results in clusters in the feature space. We investigate how the strength of this penalty correlates well with disentanglement metrics [22] training different models on Dsprites which differ by the value of β. For 15 models trained increasing β from 0 to 0.2 linearly, we observe a correlation coefficient with the DCI metric associated to representations compute by each model of 94.7, showing that the feature sharing property strongly encourages disentanglement. This confirms again that sufficiency alone (i.e. enforcing sparsity) is not enough to attain good disentanglement.', '< Task compositional generalization. Finally, we evaluate the generalization capabilities of the features learned by our method by testing our model on a set of unseen tasks obtained by combining tasks seen during training. To do this, we first train two models on the AbstractDSprites dataset using a random distribution of tasks, where we limit the support of each task to be within 2 (i.e. |S| = 2). The models differ in activating/deactivating the regularizers on the linear heads. Then, we test on 100 tasks drawn from a distribution with increasing support on the factors of variation (|S| = 3, |S| = 4, |S| = 5), which correspond to composition of tasks in the training distribution; see Figure 4, with the accompaning Table 9 in Appendix D.', '---', '> We begin by demonstrating our approach\'s capability to recover the underlying factors of variation in synthetic data distributions, consistent with [62]. In these experiments, we assume partial information about a subset of factors of variation Z, and our objective is to learn a representation ẑ that aligns with these factors while effectively disregarding any spurious ones. We sample random tasks from a distribution T (detailed in Appendix C.3), focusing on binary classification problems where Y = {0, 1}. For instance, on the DSprites dataset, a valid task could be "Is there a big object on the left of the image?". In this specific case, the partially observed factors (quantized to two values) are the x-position and size. Table 1 illustrates how our proposed feature sufficiency and minimality properties facilitate disentanglement in the learned representations. We trained two identical models on a random distribution of sparse tasks defined on FOVs. For various datasets [62,14,49,71], the model trained without regularizers achieved comparable in-distribution (ID) accuracy but significantly lower disentanglement. Figure 3, a visual representation, further highlights the critical role of minimality. It plots the DCI metric for models trained on fixed DSprites tasks: training without regularizers results in no disentanglement (green); enforcing sparsity alone (yellow, akin to [47]) yields good disentanglement (DCI = 71.9), though features may still be split or duplicated. Crucially, enforcing both minimality and sparse sufficiency (magenta) achieves the best DCI (98.8). However, when β is excessively high (> 0.25), activated features tend to collapse into a few clusters across tasks. Full results and experiments on additional datasets are provided in Table 8 and Figures 6,7 in the Appendix.', '> Subsequently, we randomly select and fix two groups of tasks, S 1 and S 2 , totaling 18 tasks. Each task in these groups is supported by two FOVs, i.e., |S 1 | = |S 2 | = 2. These groups share one factor of variation but differ in another, such that S 1 ∩ S 2 = {i} for some {i} ∈ Z. The data within these tasks are intentionally designed to contain spurious correlations, where FOVs outside the task support are spuriously correlated with the task label. We initiate with an overestimated latent dimension of z = 6, aiming to recover a true z of size 3. Our network is trained to solve these tasks by enforcing sufficiency and minimality on the representation, varying the degrees of regularization through α and β. Figure 3 visually demonstrates how the alignment of learned features with ground-truth factors of variation is critically dependent on the choice of α and β. Performance ranges from negligible disentanglement (DCI = 27.8) to strong alignment as sufficiency and minimality are increasingly enforced. The model achieving the highest alignment (DCI = 98.8) effectively utilizes both sparsity and feature sharing. While sufficiency alone (similar to [47]) can correctly identify the task-relevant support, it often leads to split or duplicated features, resulting in lower disentanglement (DCI = 71.9). The feature sharing penalty is crucial for promoting clustering in the feature space with respect to tasks, thereby ensuring high disentanglement. However, an excessively high β (β > 0.25) can lead to failure cases where features over-cluster.', '> Table 1: Enforcing disentanglement. Table 1 presents DCI [22] disentanglement scores and in-distribution (ID) accuracy on test samples. It compares a model trained without enforcing sufficiency and minimality (top row) against a model with the regularizers activated (bottom row). While both models achieve similar accuracy, the model with activated regularizers consistently demonstrates significantly higher disentanglement. Refer to Table 7 for additional scores.', '> Disentanglement and minimality are correlated. In the synthetic setting, we further illustrate the crucial role of the feature sharing penalty. Minimizing the entropy of feature activations across mini-batches of tasks effectively induces clustering within the feature space. We investigated the robust correlation between the strength of this penalty and disentanglement metrics [22] by training 15 distinct models on the DSprites dataset, each with a linearly increasing β from 0 to 0.2. We observed a strong correlation coefficient of 0.947 with the DCI metric for the representations computed by each model, unequivocally demonstrating that the feature sharing property strongly promotes disentanglement. This finding further reinforces that sufficiency alone (i.e., solely enforcing sparsity) is insufficient to achieve optimal disentanglement.', "> Task compositional generalization. We conclude our synthetic experiments by evaluating the compositional generalization capabilities of the features learned by our method. This involves testing our model on a set of unseen tasks formed by novel combinations of tasks encountered during training. To achieve this, we first trained two models on the AbstractDSprites dataset using a random distribution of tasks, with each task's support limited to two factors (|S| = 2). The models differed only in whether their linear heads had regularizers activated or deactivated. Subsequently, we evaluated these models on 100 tasks sampled from a distribution with progressively increasing support on the factors of variation (|S| = 3, |S| = 4, |S| = 5). These tasks directly correspond to compositions of the tasks seen during training. The results are visualized in Figure 4, with accompanying detailed values presented in Table 9 in Appendix D.", '67,68c67,68', '< In this section we evaluate our method on benchmarks coming from the domain generalization field [32,93,70] and subpopulation distribution shifts [73,44], to show that a feature space learned with our inductive biases performs well out of real world data distribution. Subpopulation shifts. Subpopulation shifts occur when the distribution of minority groups changes across domains. Our claim is that a feature space that satisfies sparse sufficiency and minimality is more robust to spurious correlations which may affect minority groups, and should transfer better to new distributions. To validate this, we test on two benchmarks Waterbirds [73], and CivilComments [44] (see Appendix C.1).', '< For both, we use the train and test split of the original dataset. In Table 4, last row, we report the results on the test set of Waterbirds for the different groups in the dataset (landbirds on land, landbirds on water, waterbirds on land, and waterbirds on water, respectively). We fit the linear head on a random subset of the training domain, balanced by class, repeat 10 times and report accuracy and standard deviation on test. For CivilComments we report the average and worst accuracy in Figure 5, where we compare with ERM and groupDRO [73]. While performing almost on par w.r.t. ERM, our method is more robust to spurious correlation in the dataset, showing the higher worst group accuracy. Importantly, we outperform GroupDRO, which uses information on the subdomain statistics, while we do not assume any prior knowledge about them. Results per group are reported in the Appendix (Table 11).  Camelyon17. The model is trained according to the original splits in the dataset. In Table 3 we report the accuracy of our model on in-distribution and OOD splits, compared with different baselines [84,4]. Our method shows the best performance on the OOD test domains. The intuition of why this happens is that, due to minimality, we retain more features which are shared across the three training domains, giving less importance to the ones that are domain-specific (which contain the spurious correlations with the hospital environmental informations). This can be further enforced at test time, as we show in the ablation in Appendix D.9, trading off in distribution performance for OOD accuracy. We finally show the ability of features learned with our method to adapt to a new domain with a small number of samples in a few-shot setting. We compare the results with ERM in Table 2, averaged by domains in each benchmark dataset. The full scores for each domain are in Appendix D.5 for 1-shot, 5-shot, and 10-shot setting, reporting the mean accuracy and standard deviations over 100 draws. Our approach achieves consistently higher accuracy than ERM, showing the better adaptation capabilities of our minimal and sufficently sparse feature space.', '---', '> In this section, we evaluate our method on benchmarks from the domain generalization field [32,93,70] and those involving subpopulation distribution shifts [73,44]. Our objective is to demonstrate that a feature space learned with our proposed inductive biases exhibits strong performance on out-of-distribution real-world data. Subpopulation shifts. Subpopulation shifts manifest when the distribution of minority groups varies across different domains. We hypothesize that a feature space satisfying sparse sufficiency and minimality is inherently more robust to spurious correlations that disproportionately affect minority groups, thereby facilitating superior transferability to novel distributions. To validate this claim, we conducted experiments on two established benchmarks: Waterbirds [73] and CivilComments [44] (see Appendix C.1 for details).', "> For both datasets, we utilized the original train and test splits. Table 4 (last row) presents the results on the Waterbirds test set for various demographic groups (landbirds on land, landbirds on water, waterbirds on land, and waterbirds on water, respectively). The linear head was fitted on a random, class-balanced subset of the training domain, with the process repeated 10 times to report mean accuracy and standard deviation on the test set. For CivilComments, Figure 5 displays both the average and worst-group accuracy, comparing our method against ERM and GroupDRO [73]. While achieving performance comparable to ERM in terms of average accuracy, our method demonstrates superior robustness to spurious correlations, evidenced by its higher worst-group accuracy. Crucially, we outperform GroupDRO, which explicitly leverages subdomain statistics, whereas our approach operates without any prior knowledge of group composition. Detailed per-group results are provided in Appendix (Table 11). Camelyon17. For the Camelyon17 dataset, the model is trained strictly according to its original data splits. Table 3 presents our model's accuracy on both in-distribution (ID) and out-of-distribution (OOD) splits, benchmarked against various baselines [84,4]. Our method consistently achieves superior performance on the OOD test domains. This enhanced OOD generalization is intuitively attributed to minimality, which encourages the retention of features shared across the three training domains while diminishing the importance of domain-specific features (which often encapsulate spurious correlations with hospital environmental information). This effect can be further amplified at test time, as demonstrated in the ablation study in Appendix D.9, by strategically trading off some in-distribution performance for improved OOD accuracy. Finally, we showcase the ability of features learned with our method to adapt to a new domain with a small number of samples in a few-shot setting. We compare the results with ERM in Table 2, averaged by domains in each benchmark dataset. The comprehensive scores for individual domains, including mean accuracy and standard deviations over 100 draws for 1-shot, 5-shot, and 10-shot settings, are provided in Appendix D.5. Our approach consistently yields higher accuracy than ERM, underscoring the superior adaptation capabilities of our minimal and sufficiently sparse feature space.", '71c71', '< In Appendix D we report a large collection of additional results, including comparison with 14 baseline methods on the domain shift benchmarks (D.4), a qualitative and quantitative analysis on the minimality and sparse sufficiency properties in the real setting (D.2), a favorable additional comparison on meta learning benchmarks, with 6 other baselines including [47](D.8), an ablation study on the effect of clustering features at test time (D.9), and a demonstration on the possibility to obtain a task similarity measure as a consequence of our approach (D.7).', '---', '> Appendix D provides a comprehensive collection of supplementary results. This includes a detailed comparison with 14 baseline methods on various domain shift benchmarks (D.4), a qualitative and quantitative analysis of minimality and sparse sufficiency properties in real-world settings (D.2), and a favorable comparison against 6 additional meta-learning baselines, including [47] (D.8). Furthermore, an ablation study on the impact of feature clustering at test time is presented (D.9), along with a demonstration of our approach\'s ability to yield a task similarity measure (D.7)."', '74,75c74,75', '< In this paper, we demonstrated how to learn disentangled representations from a distribution of tasks by enforcing feature sparsity and sharing. We have shown this setting is identifiable and have validated it experimentally in a synthetic and controlled setting. Additionally, we have empirically shown that these representations are beneficial for generalizing out-of-distribution in real-world settings, isolating spurious and domain specific factors that should not be used under distribution shift.', '< Limitations and future work: The main limitation of our work is the global assumption on the strength of the sparsity and feature sharing regularizers α and β across all tasks. In real settings these properties of the representations might need to change for different tasks. We have already observed this in the synthetic setting in Figure 3, where when β > 0.25 features cluster excessively and are unable to achieve clear disentanglement and do not generalize well. Future work may exploit some level of knowledge on the task distribution (e.g. some measure of distance on tasks) in order to tune α, β adaptively during training, or to train conditioning on a distribution of regularization parameters as in [21], enabling more generalization at test time. Another limitation is in the sampling procedure to fit the linear head at test time: sampling randomly from the training set (balanced by class) may not be enough to achieve the best performance under distributions shifts. Alternative sampling procedures, e.g. ones that incorporate knowledge on the distribution shift if available (as in [43]), may lead to better performance at test time.', '---', '> In this paper, we successfully demonstrated a novel approach for learning disentangled representations from a distribution of tasks, achieved by rigorously enforcing feature sparsity and sharing. We established the identifiability of this setting theoretically and validated it extensively through experiments in controlled synthetic environments. Furthermore, our empirical results robustly confirm the benefits of these representations for out-of-distribution generalization in real-world settings, specifically by effectively isolating spurious and domain-specific factors that are detrimental under distribution shifts.', '> Limitations and future work: A primary limitation of our current work lies in the global assumption regarding the strength of the sparsity and feature sharing regularizers, α and β, applied uniformly across all tasks. In practical, real-world scenarios, the optimal representation properties might vary significantly for different tasks. As observed in our synthetic experiments (Figure 3), an excessively high β (e.g., > 0.25) can lead to features clustering too aggressively, hindering clear disentanglement and generalization. Future work could address this by exploiting task-specific knowledge (e.g., task distance measures) to adaptively tune α and β during training, or by conditioning training on a distribution of regularization parameters, as in [21], to enhance test-time generalization. Another area for improvement concerns the sampling procedure used to fit the linear head at test time. Random, class-balanced sampling from the training set may not always be optimal for achieving peak performance under diverse distribution shifts. Exploring alternative sampling strategies that incorporate available knowledge about the distribution shift (e.g., as in [43]) could lead to substantial performance gains at test time.', '78c78', '< Marco Fumero and Emanuele Rodolà were supported by the ERC grant no.802554 (SPECGEO), PRIN 2020 project no.2020TA3K9N (LEGO.AI), and PNRR MUR project PE0000013-FAIR. Marco Fumero and Francesco Locatello were partially at Amazon while working at this project. We thank Julius von Kügelgen, Sebastian Lachapelle and the anonymous reviewers for their feedback and suggestions.', '---', '> Marco Fumero and Emanuele Rodolà gratefully acknowledge support from the ERC grant no.802554 (SPECGEO), PRIN 2020 project no.2020TA3K9N (LEGO.AI), and PNRR MUR project PE0000013-FAIR. Marco Fumero and Francesco Locatello were affiliated with Amazon during a portion of this project. We extend our gratitude to Julius von Kügelgen, Sebastian Lachapelle, and the anonymous reviewers for their invaluable feedback and insightful suggestions.', '235c235', '< Formula formula_0: x U g θ f ϕ ẑU g θ ŷU L inner x Q g θ f ϕ * ϕ * ẑQ g θ ŷQ L outer', '---', '> Formula formula_0: Reg(ϕ) = αReg L1 (ϕ) + βReg sharing (ϕ) (1)', '245d244', '< Formula formula_6: Q t with elements (x Q t , y Q ) ∈ Q t . Algorithm.', '250d248', '< ']
