['3c3', "< Abstract: Traditional long-tailed learning methods often perform poorly when dealing with inconsistencies between training and test data distributions, and they cannot flexibly adapt to different user preferences for trade-offs between head and tail classes. To address this issue, we propose a novel long-tailed learning paradigm that aims to tackle distribution shift in real-world scenarios and accommodate different user preferences for the trade-off between head and tail classes. We generate a set of diverse expert models via hypernetworks to cover all possible distribution scenarios, and optimize the model ensemble to adapt to any test distribution. Crucially, in any distribution scenario, we can flexibly output a dedicated model solution that matches the user's preference. Extensive experiments demonstrate that our method not only achieves higher performance ceilings but also effectively overcomes distribution shift while allowing controllable adjustments according to user preferences. We provide new insights and a paradigm for the long-tailed learning problem, greatly expanding its applicability in practical scenarios. The code can be found here: https://github.com/DataLab-atom/PRL. * Pengkun Wang and Yang Wang are corresponding authors. 38th Conference on Neural Information Processing Systems (NeurIPS 2024). common situations of distribution shift between training and testing in real-world scenarios. Some more recent works such as RIDE [32] and LADE [12]  propose using multiple expert models to obtain stronger distribution adaptability. Building on this, SADE [38] further adaptively combines the outputs of these experts during testing to adapt to the current test distribution. These approaches alleviate the problem of distribution mismatch between training and testing to some extent [23]. However, in addressing distribution shifts across different test scenarios, the goal of these multi-expert model-based methods is still to maximize the overall performance, i.e., pursuing the optimal overall performance metric across all classes, and obtaining a fixed trade-off for this purpose [13,40]. But in different application scenarios, users may have different preferences and needs for the relative trade-off between head and tail classes. Simply pursuing the overall optimal solution may not meet this flexibility requirement [17,35,43]. For example, in classifying lung CT images, when screening for difficult cases, we care more about whether all possible disease types (i.e., tail classes) can be covered to avoid missed diagnoses, compared to routine physical examinations. For some serious diseases such as lung cancer, we may also be willing to moderately increase the false positive rate in exchange for higher coverage of the tail classes, to ensure that no patients are missed. Another example is wildlife detection. Within nature reserves, we want the model to accurately detect common species (i.e., head classes) to understand their population sizes. But when looking for rare species (i.e., tail classes), we care more about covering all species, even at the cost of some false detections. As can be seen, in different application scenarios, there are significant differences in user preferences for the weighting of head and tail categories, which current long-tailed learning methods often fail to fully satisfy. Therefore, developing an interpretable and controllable method for handling long-tail distributions that adapts to specific user preferences for head and tail categories becomes a new research direction in the field of long-tailed learning. In light of this, we propose an interpretable and controllable long-tail learning method (PRL). This method aims not only to overcome potential distribution shifts from a single training distribution to any testing distribution but more importantly, to flexibly adjust the weights of head and tail categories according to actual user demands. To address these challenges, we introduce a new long-tailed learning paradigm based on a diverse set of experts and hypernetworks, which can adapt to a wide range of distribution scenarios and meet personalized user preferences. To tackle the aforementioned challenges, we propose a new long-tailed learning paradigm based on diverse experts and hypernetworks, as illustrated in Figure . For the first challenge, existing multi-expert model-based methods train fixed expert models for specific distributions, requiring strong distribution assumptions and struggling to handle more complex and variable distributions. Therefore, instead of maximizing the performance of each expert individually, we pursue modeling and optimizing the hypervolume over the entire Pareto front curve, learning a set of solutions that cover all possible distribution scenarios. This requires us to sample with the goal of covering the entire Pareto front during optimization. For the second challenge, unlike LADE and SADE which output a fixed trade-off solution under distribution shift, we can flexibly output a dedicated model solution that matches the user's preference in any test distribution scenario. In this way, our method can not only adapt to changes in the test distribution, but also allow controllable adjustment of the head-tail trade-off according to the user's actual needs. Our contributions can be summarized as follows: • New scenario and insight: we make the first attempt at a controllable trade-off based on user preferences in the context of long-tailed learning and test distribution shift scenarios, greatly expanding the applicability in real-world scenarios. • New learning paradigm: we propose a new interpretable and controllable long-tailed learning method that can acquire the ability to overcome test distribution shift from a single distribution dataset and satisfy user preferences in any shifted distribution scenario. • Compelling empirical results: extensive experiments demonstrate that our method achieves higher performance ceilings, effectively overcomes test distribution shift, and can be controlled by user preferences.", '---', '> Abstract: Traditional long-tailed learning methods often struggle with distribution shifts between training and test data, and lack flexible adaptation to user preferences for head and tail class trade-offs. To address this, we propose a novel long-tailed learning paradigm that leverages hypernetworks to generate a diverse set of expert models. This approach enables the model ensemble to robustly adapt to various test distributions while offering controllable adjustments according to user-defined preferences. Unlike prior methods that yield fixed trade-offs, our paradigm allows for dynamic, interpretable control over the balance between head and tail class performance. Extensive experiments demonstrate that our method not only achieves superior performance but also effectively overcomes distribution shifts and provides unprecedented controllable adjustments based on user preferences. This work offers new insights and a flexible paradigm for long-tailed learning, significantly expanding its practical applicability. The code can be found here: https://github.com/DataLab-atom/PRL. * Pengkun Wang and Yang Wang are corresponding authors. 38th Conference on Neural Information Processing Systems (NeurIPS 2024).', '7c7', '< To address the long-tailed distribution problem, existing research has proposed a series of methods such as re-sampling [25,5,24,10] and modifying the loss function [17,6], with the common idea of focusing on improving the performance of the tail classes. However, these methods typically assume that the distributions of the training and test data remain invariant, and thus cannot well handle the  ', '---', '> To address the long-tailed distribution problem, existing research has proposed a series of methods such as re-sampling [25,5,24,10] and modifying the loss function [17,6], with the common idea of focusing on improving the performance of the tail classes. However, these methods typically assume that the distributions of the training and test data remain invariant, and thus cannot well handle the common situations of distribution shift between training and testing in real-world scenarios.', '8a9,21', '> Some more recent works, such as RIDE [32] and LADE [12], propose using multiple expert models to obtain stronger distribution adaptability. Building on this, SADE [38] further adaptively combines the outputs of these experts during testing to adapt to the current test distribution. While these approaches alleviate the problem of distribution mismatch between training and testing to some extent [23], they primarily aim to maximize overall performance, pursuing a fixed optimal performance metric across all classes [13,40]. This fixed trade-off often fails to meet the diverse needs and preferences of users in different application scenarios [17,35,43].', '> ', '> For example, in classifying lung CT images, when screening for difficult cases, a user might prioritize covering all possible disease types (i.e., tail classes) to avoid missed diagnoses, even if it moderately increases the false positive rate for common conditions. Conversely, in routine physical examinations, the focus might be on high accuracy for head classes. Another example is wildlife detection: within nature reserves, accurately detecting common species (head classes) is crucial for population monitoring, but when searching for rare species (tail classes), maximizing coverage of all species becomes paramount, potentially at the cost of some false detections. These scenarios highlight significant differences in user preferences for weighting head and tail categories, which current long-tailed learning methods often fail to fully satisfy.', '> ', '> Therefore, developing an interpretable and controllable method for handling long-tail distributions that adapts to specific user preferences for head and tail categories becomes a critical new research direction. In light of this, we propose an interpretable and controllable long-tail learning method (PRL). This method aims not only to overcome potential distribution shifts from a single training distribution to any testing distribution but, more importantly, to flexibly adjust the weights of head and tail categories according to actual user demands.', '> ', "> To address these challenges, we introduce a new long-tailed learning paradigm, PRL, based on diverse experts and hypernetworks, as illustrated in Figure 1. For the first challenge, existing multi-expert model-based methods train fixed expert models for specific distributions, requiring strong distribution assumptions and struggling to handle more complex and variable distributions. Instead of maximizing the performance of each expert individually, we pursue modeling and optimizing the hypervolume over the entire Pareto front curve, learning a set of solutions that cover all possible distribution scenarios. This approach requires us to sample with the goal of covering the entire Pareto front during optimization. For the second challenge, unlike LADE and SADE which output a fixed trade-off solution under distribution shift, PRL can flexibly output a dedicated model solution that matches the user's preference in any test distribution scenario. In this way, our method not only adapts to changes in the test distribution but also allows controllable adjustment of the head-tail trade-off according to the user's actual needs.", '> ', '> Our contributions can be summarized as follows:', '> •   **Novel Problem Formulation and Insight:** We are the first to propose a controllable trade-off mechanism based on user preferences in the context of long-tailed learning with test distribution shifts, significantly expanding the applicability of LTL in real-world scenarios.', '> •   **New Learning Paradigm:** We introduce PRL, an interpretable and controllable long-tailed learning method that leverages hypernetworks to acquire the ability to overcome test distribution shifts from a single training dataset and satisfy diverse user preferences in any shifted distribution scenario.', '> •   **Compelling Empirical Results:** Extensive experiments demonstrate that our method achieves higher performance ceilings, effectively overcomes test distribution shifts, and provides fine-grained control over head-tail class trade-offs according to user preferences across various benchmark datasets.', '> ', '10,11c23,27', '< Long-tailed distributions are prevalent in real-world data, leading to imbalanced datasets that pose challenges for machine learning models [30,20]. To address this issue, researchers have proposed various methods, including re-sampling, loss function modification, and multi-expert models.', '< Re-sampling methods balance class distributions by oversampling tail classes [24,3] or undersampling head classes [7]. Loss function modification approaches assign higher weights to tail class losses [27,26] or use meta-learning to alleviate undersampling issues [14,32]. Multi-expert models train multiple experts on different class distributions and combine their outputs, adapting to various test distributions [37,38,31]. Most existing methods assume specific distributions during training or testing, limiting real-world applicability with distribution shifts, and cannot accommodate varying user needs for head and tail class trade-offs. We propose an approach to overcoming distribution assumptions and achieve interpretable, controllable trade-offs in long-tailed learning.', '---', '> Long-tailed distributions are prevalent in real-world data, leading to imbalanced datasets that pose significant challenges for machine learning models [30,20]. To address this issue, researchers have proposed various methods, broadly categorized into re-sampling, loss function modification, and multi-expert models.', '> Re-sampling methods aim to balance class distributions by oversampling tail classes [24,3] or undersampling head classes [7]. While effective in mitigating imbalance, oversampling can lead to overfitting, and undersampling may discard valuable information.', "> Loss function modification approaches assign higher weights to tail class losses [27,26] or use meta-learning to alleviate undersampling issues [14,32]. These methods directly influence the model's learning focus but often require careful hyperparameter tuning.", '> Multi-expert models train multiple specialized experts on different class distributions and combine their outputs, adapting to various test distributions [37,38,31]. These methods show promise in handling distribution shifts.', '> However, most existing methods share common limitations: they often assume specific distributions during training or testing, which limits their real-world applicability in the face of dynamic distribution shifts. Crucially, they cannot accommodate varying user needs for head and tail class trade-offs, providing only a fixed, "one-size-fits-all" solution. Our proposed approach directly tackles these limitations by overcoming rigid distribution assumptions and achieving interpretable, controllable trade-offs in long-tailed learning.', '14,32c30', "< In this section, we analyze the distribution shift problem from a theoretical perspective and provide the definition and properties of the environment's total variation distance, laying the theoretical foundation for the methods section. Traditional empirical risk minimization (ERM) methods on a single training distribution struggle to handle distribution discrepancy, which can affect generalization. This limitation can be characterized by the following theorem:", '< Theorem 1 (Limitation of ERM). Let f (x; θ) be a classifier learned via ERM on E m , then its risk on the test environment E test is:', '< R test (f ) = R m (f ) + K i=1 (π test i -π m i ) • E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)](1)', '< where R m (f ) and R test (f ) are the risks of f on E m and E test , respectively, and π test is the class prior of the test environment.', "< To measure the distribution discrepancy across environments, we introduce the environment's total variation distance (ETVD): Definition 2 (Environment Total Variation Distance). The total variation distance between environments E i and E j is defined as:", '< δ(E i , E j ) = 1 2 K k=1 |π i k -π j k |', '< , and the ETVD of M environments is defined as: ∆(E 1 , . . . , E M ) = max i,j∈{1,...,M }', '< δ(E i , E j )', '< Using ETVD, we can further bound the risk of the ERM-learned classifier on the test environment: Corollary 1. Under the assumptions of Theorem 1, let M = max i,x ℓ(f (x; θ), i), then', '< R test (f ) ≤ R m (f ) + 2M • (δ(E m , E test ) + ∆(E 1 , . . . , E M ))(2)', '< This corollary shows that the test risk of the ERM-learned classifier is affected not only by the distribution discrepancy between the training environment and the test environment but also by the distribution discrepancy among training environments (i.e., ETVD). To overcome the diversity shift, we propose minimizing the empirical risks across multiple training environments to capture the distributional characteristics of different environments, thereby learning a set of diversity experts.', '< Next, we provide a theoretical analysis of the domain adaptation algorithm based on diversity experts proposed in this paper. To characterize the generalization performance of this algorithm, we first introduce the following notations:', '< Let {f 1 , . . . , f N } be the N experts learned via ERM on the N training environments {E 1 , . . . , E N }, respectively, and f be the final classifier obtained by ensembling these N experts. Define the empirical risk of the ensemble classifier f on environment E m as:', '< Rm ( f ) = 1 N N i=1 R m (f i )(3)', '< We can obtain the following theorem regarding the generalization performance of the ensemble classifier: Theorem 2. Under the above notations and definitions, the risk of the ensemble classifier f on the test environment E test satisfies:', '< R test ( f ) ≤ 1 N N m=1 R m (f m ) + 2M • 1 N N m=1 δ(E m , E test ) + N -1 N ∆(E 1 , . . . , E N )(4)', '< where', '< M = max i,x ℓ( f (x), i).', '< Theorem 2 shows that the test risk of the ensemble classifier consists of three parts: the average empirical risk of all experts, the average total variation distance between the training environments and the test environment, and the weighted average of ETVD among the training environments. Compared to single-environment ERM, the diversity experts method learns a set of experts to capture the distributional characteristics of different environments, which can reduce the distribution discrepancy between the training environments and the test environment, thereby achieving better generalization performance.', '---', '> In this section, we provide a theoretical analysis of the distribution shift problem and introduce the concept of Environment Total Variation Distance (ETVD), which forms the theoretical foundation for our proposed methodology. Traditional Empirical Risk Minimization (ERM) methods, when trained on a single distribution, often struggle to generalize effectively under distribution shifts, a limitation formally characterized by the following theorem:', '33a32,56', '> Theorem 1 (Limitation of ERM). Let f (x; θ) be a classifier learned via ERM on environment E_m . Its risk on a test environment E_{test} is given by:', '> R_{test} (f ) = R_m (f ) + \\sum_{i=1}^K (\\pi_{test_i} - \\pi_{m_i}) • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] (1)', '> where R_m (f ) and R_{test} (f ) are the risks of f on E_m and E_{test} , respectively, and $\\pi_{test}$ is the class prior of the test environment.', '> ', '> To quantify the distribution discrepancy across different environments, we introduce the Environment Total Variation Distance (ETVD):', '> Definition 2 (Environment Total Variation Distance). The total variation distance between environments E_i and E_j is defined as:', '> δ(E_i , E_j ) = \\frac{1}{2} \\sum_{k=1}^K |\\pi_{i_k} - \\pi_{j_k}|', '> , and the ETVD of M environments is defined as:', '> ∆(E_1 , \\ldots , E_M ) = \\max_{i,j∈\\{1,\\ldots,M \\}} δ(E_i , E_j )', '> Using ETVD, we can further bound the risk of the ERM-learned classifier on the test environment:', '> ', '> Corollary 1. Under the assumptions of Theorem 1, let M = \\max_{i,x} ℓ(f (x; θ), i). Then,', '> R_{test} (f ) ≤ R_m (f ) + 2M • (δ(E_m , E_{test}) + ∆(E_1 , \\ldots , E_M )) (2)', '> This corollary demonstrates that the test risk of an ERM-learned classifier is influenced not only by the direct distribution discrepancy between its training environment and the test environment but also by the overall distribution discrepancy among all available training environments (i.e., ETVD). To mitigate this, our approach involves minimizing empirical risks across multiple training environments to capture diverse distributional characteristics, thereby learning a set of diverse experts.', '> ', '> Next, we provide a theoretical analysis of the generalization performance of our diversity-aware ensemble algorithm. We first introduce the following notations:', '> Let $\\{f_1 , \\ldots , f_N \\}$ be the N experts learned via ERM on the N training environments $\\{E_1 , \\ldots , E_N \\}$, respectively, and f be the final classifier obtained by ensembling these N experts. Define the empirical risk of the ensemble classifier f on environment E_m as:', '> R_m ( f ) = \\frac{1}{N} \\sum_{i=1}^N R_m (f_i) (3)', '> We can then state the following theorem regarding the generalization performance of the ensemble classifier:', '> ', '> Theorem 2. Under the above notations and definitions, the risk of the ensemble classifier f on the test environment E_{test} satisfies:', '> R_{test} ( f ) ≤ \\frac{1}{N} \\sum_{m=1}^N R_m (f_m) + 2M • \\frac{1}{N} \\sum_{m=1}^N δ(E_m , E_{test}) + \\frac{N - 1}{N} ∆(E_1 , \\ldots , E_N) (4)', '> where M = \\max_{i,x} ℓ( f (x), i).', '> Theorem 2 reveals that the test risk of the ensemble classifier is composed of three key components: the average empirical risk of all experts, the average total variation distance between the training environments and the test environment, and a weighted average of the ETVD among the training environments. Compared to single-environment ERM, the diversity experts method learns a set of experts that collectively capture the distributional characteristics of various environments. This strategy effectively reduces the distribution discrepancy between the training and test environments, leading to superior generalization performance.', '> ', '49,57c72,80', '< Section: Diverse Experts', '< Let X and Y denote the input and output spaces, respectively. We introduce T = 3 classifiers {f i } T i=1 as diverse experts. These experts share a feature extractor ϕ θ : X → R d , but use different classifier heads {g wi } T i=1 :', '< f i (x) = g wi (ϕ θ (x)), i = 1, . . . , T(8)', '< To generate diverse experts, we introduce a hypernetwork h ψ that takes random noise z ∈ R k as input and outputs the classifier head parameters w i ∈ R d×C :', '< w i = h ψ (z i ), z i ∼ Dir(α), i = 1, . . . , T(9)', '< where Dir(α) is the Dirichlet distribution with parameter α ∈ R k + . The hypernetwork h ψ consists of three linear layers with ReLU activations.', '< During training, we sample {z i } T i=1 from Dir(α) and use h ψ to generate {w i } T i=1 . The loss function for a training batch is:', '< L = T i=1 L i (f i )(10)', '< where L i is the classification loss for the i-th expert f i , defined the same as in SADE: L 1 is the standard cross-entropy loss; L 2 is the balanced softmax loss, where the logits are adjusted by adding the log of the prior probabilities of each class; L 3 is the inverse softmax loss, where the logits are adjusted by adding the log of the prior probabilities and subtracting the scaled log of the inverse prior probabilities.', '---', '> Section: Diverse Experts via Hypernetworks', '> We introduce T = 3 classifiers {f i } T i=1 as diverse experts, designed to capture different aspects of the long-tailed distribution. These experts share a common feature extractor ϕ θ : X → R d , but each utilizes a distinct classifier head g wi :', '> f i (x) = g wi (ϕ θ (x)), for i = 1, . . . , T (8)', '> To generate these diverse experts, we employ a hypernetwork h ψ . The hypernetwork takes a low-dimensional random noise vector z ∈ R k as input and generates the parameters w i ∈ R d×C for each classifier head:', '> w i = h ψ (z i ), where z i ∼ Dir(α), for i = 1, . . . , T (9)', '> Here, Dir(α) denotes the Dirichlet distribution with parameter α ∈ R k + , which helps in generating diverse and well-distributed input noise vectors. The hypernetwork h ψ itself consists of three linear layers with ReLU activations.', '> During training, we sample {z i } T i=1 from Dir(α) and use h ψ to generate the corresponding {w i } T i=1 . The overall loss function for a given training batch is defined as the sum of individual expert losses:', '> L = T i=1 L i (f i ) (10)', '> Each L i is the classification loss for the i-th expert f i . Following SADE, we define these losses to encourage diversity: L 1 is the standard cross-entropy loss; L 2 is the balanced softmax loss, which adjusts logits by adding the log of class prior probabilities; and L 3 is the inverse softmax loss, which adjusts logits by adding the log of prior probabilities and subtracting scaled log inverse prior probabilities. This setup ensures that experts are trained with varying inductive biases, promoting specialization.', '59,68c82,90', '< Section: Stochastic Convex Ensemble', '< Let L i (Θ, D) denote the loss function of the i-th expert f i on dataset D, where Θ = {θ, ψ} represents all trainable parameters. The objective is to jointly optimize the losses of all T experts:', '< min Θ T i=1 L i (Θ, D)(11)', '< To promote diversity among experts, we introduce the Stochastic Convex Ensemble (SCE) strategy, which aims to minimize the worst-case loss of the convex combination of experts:', '< min Θ max p∈∆ T T i=1 p i L i (Θ, D)(12)', '< where p = (p 1 , • • • , p T ) ⊤ ∈ ∆ T is the weight vector, and', '< ∆ T := {p ∈ R T + | T i=1 p i = 1} is the T -dimensional simplex.', '< Inspired by the max-min inequality, we relax the SCE objective to:', '< min Θ T i=1 L i (Θ, D) + λ • log T i=1 exp 1 λ L i (Θ, D)(13)', '< where λ > 0 is a hyperparameter. As λ → 0, the relaxed objective approaches the original SCE objective. The term λ • log T i=1 exp 1 λ L i (Θ, D) promotes diversity among experts.', '---', '> Section: Stochastic Convex Ensemble for Diversity', '> Let L i (Θ, D) denote the loss function of the i-th expert f i on dataset D, where Θ = {θ, ψ} encompasses all trainable parameters (feature extractor and hypernetwork). Our objective is to jointly optimize the losses of all T experts:', '> min Θ T i=1 L i (Θ, D) (11)', '> To further promote diversity and robustness among the experts, we introduce the Stochastic Convex Ensemble (SCE) strategy. SCE aims to minimize the worst-case loss of a convex combination of experts, effectively searching for a solution that performs well even under adversarial weighting of experts:', '> min Θ max p∈∆ T T i=1 p i L i (Θ, D) (12)', '> Here, p = (p 1 , • • • , p T ) ⊤ ∈ ∆ T is a weight vector, and ∆ T := {p ∈ R T + | T i=1 p i = 1} represents the T -dimensional simplex. This objective encourages the model to be robust against different expert weightings.', '> Inspired by the max-min inequality, we relax the SCE objective into a more tractable form:', '> min Θ T i=1 L i (Θ, D) + λ • log T i=1 exp 1 λ L i (Θ, D) (13)', '> where λ > 0 is a hyperparameter controlling the strength of the diversity regularization. As λ → 0, this relaxed objective asymptotically approaches the original SCE objective. The added regularization term, λ • log T i=1 exp 1 λ L i (Θ, D), explicitly promotes diversity among the experts by penalizing scenarios where one expert significantly outperforms others, thus encouraging a more balanced performance across the ensemble.', '71c93', '< During testing, we can control the trade-off between head and tail classes using a preference vector', '---', '> A key advantage of our proposed paradigm is the ability to control the trade-off between head and tail classes during inference using a user-defined preference vector. This preference vector, denoted as', '73,80c95,103', '< , where ∆ 3 is the 3-dimensional simplex. Given a trained preference vector r = (r 1 , r 2 , r 3 ) ⊤ ∈ ∆ 3 , we compute the test-time preference vector r ∈ ∆ 3 as:', '< r = r ⊙ α * r ⊤ α * (14', '< )', '< where ⊙ denotes the Hadamard product. The test-time preference vector r is then input to the hypernetwork h ψ to generate the classifier head parameters for each expert:', '< Ŵi = h ψ (r), i = 1, • • • , T(15)', "< where Ŵi ∈ R d×C is the weight matrix for the i-th expert's classifier head. For a test sample with feature vector x ∈ R d , the output of the i-th expert is:", '< ŷi = \uf8f1 \uf8f4 \uf8f2 \uf8f4 \uf8f3 x ⊤ ∥x∥ 2 • Ŵi ∥ Ŵi ∥ F , if normalized x ⊤ Ŵi + b⊤ i , otherwise(16)', "< where ∥ • ∥ F is the Frobenius norm and bi ∈ R C is the bias vector for the i-th expert. By adjusting α * , we can control the model's focus on head or tail classes, enabling flexible trade-offs to suit different application needs. An observation on our method. To better understand this part, we use Figure 2 to demonstrate the effectiveness of preference control in overcoming distribution shifts, as well as the flexibility of our method. For preferences, the coordinate system is a three-dimensional orthogonal coordinate; for accuracy, the coordinate system represents the performance on the farward50, uni., and backward50 splits of the CIFAR100-LT dataset. The dark plane represents the plane formed by different preference vectors, and the outer surface represents the corresponding performance on the three distributions for these preference vectors. The yellow dots are the results of running SADE, whose preferences are uncontrollable, so the results of each run are random dots, lying below our purple plane, indicating that their performance is lower than our method (i.e., being dominated in the Pareto optimal set). This figure illustrates that our method can cover unknown distributions without additional training, and unlike previous methods, it can trade off performance by adjusting the preference vector. We will analyze this in more depth in the experimental section. ", '---', '> , where ∆ 3 is the 3-dimensional simplex, allows users to explicitly specify their desired emphasis on different class groups.', '> Given a pre-trained preference vector r = (r 1 , r 2 , r 3 ) ⊤ ∈ ∆ 3 , we compute the test-time adjusted preference vector r ∈ ∆ 3 as:', '> r = r ⊙ α * / (r ⊤ α *) (14)', '> where ⊙ denotes the Hadamard (element-wise) product. This normalization ensures that the adjusted preference vector r remains within the simplex. The adjusted preference vector r is then fed into the hypernetwork h ψ to dynamically generate the classifier head parameters for each expert:', '> Ŵi = h ψ (r), for i = 1, • • • , T (15)', "> Here, Ŵi ∈ R d×C represents the weight matrix for the i-th expert's classifier head. For a given test sample with feature vector x ∈ R d , the output of the i-th expert is computed as:", '> ŷi = \uf8f1 \uf8f4 \uf8f2 \uf8f4 \uf8f3 x ⊤ / ∥x∥ 2 • Ŵi / ∥ Ŵi ∥ F , if normalized feature x ⊤ Ŵi + b⊤ i , otherwise (16)', "> where ∥ • ∥ F is the Frobenius norm, and bi ∈ R C is the bias vector for the i-th expert. By simply adjusting α * , users can control the model's focus on head or tail classes, enabling flexible trade-offs to suit diverse application needs without requiring model retraining.", "> To illustrate the effectiveness of this preference control, Figure 2 demonstrates how our method can overcome distribution shifts and provide flexibility. The coordinate system for preferences is three-dimensional orthogonal, while for accuracy, it represents performance on the forward50, uniform, and backward50 splits of the CIFAR100-LT dataset. The dark plane illustrates the space of different preference vectors, and the outer surface shows the corresponding performance across these three distributions. Yellow dots represent results from SADE, which lacks controllable preferences, thus yielding random performance points that lie below our purple plane, indicating that SADE's performance is dominated by our method in the Pareto optimal set. This figure visually confirms that our method can adapt to unknown distributions without additional training and, unlike previous methods, allows for performance trade-offs by adjusting the preference vector. A more in-depth analysis of this mechanism is provided in the experimental section.", '86,88c109,111', '< Datasets. We evaluate our method on four benchmark datasets: ImageNet-LT [20], CIFAR100-LT [4], Places-LT [20], and iNaturalist 2018 [29]. These datasets have varying imbalance ratios, ranging from 10 to 256. CIFAR100-LT has three versions with different imbalance ratios. Detailed statistics are in Appendix D.', '< Baselines. We compare PRL with various state-of-the-art long-tailed recognition methods, including two-stage methods (MiSLAS [41]), logit-adjusted training (Balanced Softmax [15], LADE [12]), ensemble learning (RIDE [32], SADE [38]), causal inference (Causal [28]), representation learning (LSC [33]), and balanced posterior averaging (BalPoE [1]). These methods address the long-tail problem from different perspectives. Further details are provided in the appendixA.', '< Evaluation protocols and implementation details. We evaluate the models on multiple test datasets with different class distributions using micro accuracy. We report the accuracy of many-shot, mediumshot, and few-shot classes. We use the same setup for all methods, including ResNeXt-50 for ImageNet-LT, ResNet-32 for CIFAR100-LT, ResNet-152 for Places-LT, and ResNet-50 for iNaturalist 2018 as backbones. We employ hypernets (MLPs) to output trainable parameters of experts and adopt the cosine classifier for prediction. Unless specified, we use α = 1.2 for the Dirichlet distribution, µ = 0.3 for stochastic annealing, SGD with momentum 0.9, train for 200 epochs, and set the initial learning rate to 0.1 with linear decay. During test-time training, we train aggregation weights for 5 epochs with a batch size of 128, using the same optimizer and learning rate as in training. Other details please refer to Appendix G.', '---', '> Datasets. We evaluate our method on four widely-used benchmark datasets: ImageNet-LT [20], CIFAR100-LT [4], Places-LT [20], and iNaturalist 2018 [29]. These datasets exhibit varying imbalance ratios, ranging from 10 to 256, providing a comprehensive testbed. CIFAR100-LT is evaluated with three different imbalance ratios (IR=10, 50, and 100). Detailed statistics for all datasets are provided in Appendix D.', '> Baselines. We conduct comprehensive comparisons against various state-of-the-art long-tailed recognition methods. These include two-stage methods (MiSLAS [41]), logit-adjusted training methods (Balanced Softmax [15], LADE [12]), ensemble learning approaches (RIDE [32], SADE [38]), causal inference models (Causal [28]), representation learning techniques (LSC [33]), and balanced posterior averaging methods (BalPoE [1]). Each of these baselines addresses the long-tail problem from distinct perspectives. Further details on these methods can be found in Appendix A.', '> Evaluation Protocols and Implementation Details. We evaluate all models using micro accuracy on multiple test datasets, each featuring different class distributions. Performance is reported for many-shot, medium-shot, and few-shot classes to provide a granular view of class-wise performance. For fair comparison, we maintain a consistent backbone architecture across all methods for each dataset: ResNeXt-50 for ImageNet-LT, ResNet-32 for CIFAR100-LT, ResNet-152 for Places-LT, and ResNet-50 for iNaturalist 2018. Our method employs hypernetworks (implemented as MLPs) to generate the trainable parameters of the expert classifier heads, and we utilize a cosine classifier for final predictions. Unless otherwise specified, we use α = 1.2 for the Dirichlet distribution, µ = 0.3 for stochastic annealing, and train for 200 epochs using SGD with a momentum of 0.9. The initial learning rate is set to 0.1 with a linear decay schedule. During the test-time adaptation phase, aggregation weights are trained for 5 epochs with a batch size of 128, using the same optimizer and learning rate as in the main training phase. Additional implementation details are provided in Appendix G.', '95,98c118', '< On different test distributions of both datasets, PRL consistently outperforms all compared methods. On CIFAR100-LT (IR100), PRL achieves the highest accuracy across all settings, surpassing LSC [33], BalPoE [1], and SADE [38].', '< Even under the most challenging backward LT distribution, PRL can maintain its outstanding performance. On ImageNet-LT, PRL obtains the best results across all test distributions, significantly outperforming LSC, BalPoE, and SADE.', "< The consistent improvements achieved by PRL highlight its higher performance ceiling, indicating the effectiveness of our method design in overcoming distribution shifts. We further conduct distribution-shift experiments on Places-LT and iNaturalist 2018, where PRL also achieves impressive results. Please refer to the AppendixF for detailed results.  User preference control. We evaluated the model's performance on many-shot, medium-shot, and few-shot classes on CIFAR100-LT under different preference settings (R=(1.0, 2.7), R=(0.5, 2.5), R=(1.9, 1.1)). Table 4 shows that by adjusting the preference value R, we can effectively control the trade-off between many-shot and few-shot classes. When R=(1.0, 2.7), the model performs best on many-shot classes; when R=(1.9, 1.1), the model performs better on few-shot classes at the cost of a slight drop in performance on many-shot classes. R=(0.5, 2.5) achieves the best performance on medium-shot classes, indicating that our method can balance performance across different classes with an appropriate setting. As shown in Figure 3, we analyze the performance trade-offs between head and tail classes across three different distributions, demonstrating how our preference control mechanism allows flexible adjustment of model behavior. The results clearly show how adjusting preferences affects accuracy across different class frequency groups. Figure 4 provides a more comprehensive visualization of the performance on the head classes under the forward50 distribution. The plane represents the performance of the head classes without inputting any preference, while the red dots indicate the preference positions in polar coordinates that can improve the performance of the head classes, and the green dots represent the preference positions that may degrade the performance. These experimental results demonstrate the effectiveness of our method in controlling the trade-off for long-tailed classes based on user preferences. By adjusting the preference without the need for retraining the model, we can flexibly adapt to different application scenarios and requirements, achieving a desired trade-off in long-tailed recognition tasks that aligns with practical needs.", "< Ablation study. We conduct ablation studies on CIFAR100-LT to evaluate the impact of removing the hypernetwork (w.o. hnet) and removing the Chebyshev polynomial (w.o. stch) on the model's performance under different unknown test class distributions (as shown in Figure 5). The complete model (ours) performs best across all distributions. Removing either the hypernetwork or the Chebyshev polynomial leads to performance degradation, highlighting their importance in dynamically adjusting the model behavior to adapt to distribution shifts and learning preference-aware representations. This ablation study verifies the effectiveness of different components in our method, which work together to better handle unknown test distributions and data imbalance issues in long-tailed recognition.", '---', '> On different test distributions, PRL consistently outperforms all compared methods. Specifically, on CIFAR100-LT (IR=100), PRL achieves the highest accuracy across all settings, surpassing strong baselines like LSC [33], BalPoE [1], and SADE [38]. Even under the most challenging backward LT distribution, PRL maintains its outstanding performance. Similarly, on ImageNet-LT, PRL obtains the best results across all test distributions, significantly outperforming LSC, BalPoE, and SADE. The consistent improvements achieved by PRL highlight its higher performance ceiling and demonstrate the effectiveness of our method design in robustly overcoming distribution shifts. Detailed distribution-shift experimental results for Places-LT and iNaturalist 2018 can be found in Appendix F.', '99a120,123', "> User preference control. We extensively evaluated PRL's performance on many-shot, medium-shot, and few-shot classes on CIFAR100-LT under various preference settings, represented by R vectors (e.g., R=(1.0, 2.7), R=(0.5, 2.5), R=(1.9, 1.1)). As shown in Table 4, by adjusting the preference value R, we can effectively control the trade-off between many-shot and few-shot classes. For instance, R=(1.0, 2.7) yields the best performance on many-shot classes, while R=(1.9, 1.1) prioritizes few-shot classes, albeit with a slight drop in many-shot performance. An intermediate setting like R=(0.5, 2.5) achieves optimal performance on medium-shot classes, demonstrating the method's ability to balance performance across different class frequency groups. Figure 3 visually analyzes these performance trade-offs across three different distributions, clearly illustrating how our preference control mechanism enables flexible adjustment of model behavior. Figure 4 provides a more comprehensive visualization of performance on head classes under the forward50 distribution, showing how specific preference positions (red dots for improvement, green for degradation) in polar coordinates influence head class accuracy. These results unequivocally demonstrate the effectiveness of our method in controlling long-tailed class trade-offs based on user preferences. This ability to adapt to different application scenarios and requirements by adjusting preferences, without requiring model retraining, is a significant advancement in achieving desired trade-offs in long-tailed recognition tasks.", '> ', "> Ablation study. We conducted ablation studies on CIFAR100-LT to dissect the impact of key components: removing the hypernetwork (w.o. hnet) and removing the Chebyshev polynomial (w.o. stch). Figure 5 illustrates the performance under different unknown test class distributions. The complete PRL model consistently outperforms its ablated versions across all distributions. Removing either the hypernetwork or the Chebyshev polynomial leads to noticeable performance degradation, underscoring their critical importance in dynamically adjusting model behavior to adapt to distribution shifts and learning preference-aware representations. This ablation study validates the synergistic effectiveness of our method's components in handling unknown test distributions and mitigating data imbalance issues in long-tailed recognition.", '> ', '104,109c128,133', '< • Oversampling [5,9]: Generates synthetic examples for minority classes.', '< -Alleviates imbalance by increasing tail class samples.', '< -Can lead to overfitting and high computational cost.', '< • Undersampling [7,2]: Removes examples from majority classes.', '< -Simple and efficient approach to balance classes.', '< -Discards potentially valuable head class information.', '---', '> •   **Oversampling** [5,9]: Generates synthetic examples for minority classes.', '>     -   *Advantages*: Alleviates imbalance by increasing tail class samples.', '>     -   *Disadvantages*: Can lead to overfitting on synthetic data and high computational cost.', '> •   **Undersampling** [7,2]: Removes examples from majority classes.', '>     -   *Advantages*: Simple and efficient approach to balance classes.', '>     -   *Disadvantages*: Discards potentially valuable head class information, leading to reduced overall performance.', '112,119c136,144', '< • Focal Loss [17] and variants [6]:', '< -Imposes larger penalties on well-classified examples, encouraging focus on hard samples.', '< -Requires careful tuning of focusing parameter.', '< • Class-Balanced Loss [6]:', '< -Re-weights loss based on effective number of samples per class.', '< -Assumes equal importance of classes, which may not hold in practice.', "< • LDAM [4]: -Explicitly models each example's contribution to the gradient direction.", '< -Requires additional hyperparameters and complex optimization.', '---', '> •   **Focal Loss** [17] and variants [6]:', '>     -   *Mechanism*: Imposes larger penalties on well-classified examples, encouraging the model to focus on hard samples, particularly from tail classes.', '>     -   *Disadvantages*: Requires careful tuning of focusing parameters and may not generalize across all imbalance ratios.', '> •   **Class-Balanced Loss** [6]:', '>     -   *Mechanism*: Re-weights loss based on the effective number of samples per class, giving more importance to tail classes.', '>     -   *Disadvantages*: Assumes equal importance of classes, which may not hold in practice, and can sometimes overcompensate.', '> •   **LDAM (Label-Distribution-Aware Margin Loss)** [4]:', ">     -   *Mechanism*: Explicitly models each example's contribution to the gradient direction by introducing class-dependent margins.", '>     -   *Disadvantages*: Requires additional hyperparameters and can involve complex optimization, making it harder to implement and tune.', '122,130c147,155', '< • Decoupled Learning [16,43]:', '< -Separates representation and classifier learning for better feature extraction.', '< -Requires architectural changes, may not generalize well.', '< • Few-Shot Experts [32]:', '< -Employs additional experts to handle few-shot classes.', '< -Increased model complexity and training difficulty.', '< • Self-Supervised Pretraining [14]:', '< -Leverages self-supervision to improve feature representations.', '< -Requires additional pretraining, benefits may be task-specific.', '---', '> •   **Decoupled Learning** [16,43]:', '>     -   *Mechanism*: Separates representation learning from classifier training to mitigate bias towards head classes and achieve better feature extraction.', '>     -   *Disadvantages*: Requires architectural changes, and the two-stage process may not generalize well across all datasets or tasks.', '> •   **Few-Shot Experts** [32]:', '>     -   *Mechanism*: Employs additional specialized experts specifically designed to handle few-shot classes, often trained with different strategies.', '>     -   *Disadvantages*: Increases model complexity and training difficulty due to managing multiple expert networks.', '> •   **Self-Supervised Pretraining** [14]:', '>     -   *Mechanism*: Leverages self-supervision to learn robust and balanced feature representations before fine-tuning on the long-tailed dataset.', '>     -   *Disadvantages*: Requires additional pretraining resources, and the benefits may be highly task-specific.', '133,139c158,164', '< • Data-Based Transfer [20,14]:', '< -Knowledge distillation and feature transformation can transfer head knowledge to tails.', '< -Assumes head and tail distributions are related, may suffer negative transfer.', '< • Model-Based Transfer [36]:', '< -Utilizes models pretrained on heads to facilitate tail class learning.', '< -Again assumes related head and tail distributions.', '< Despite progress, existing LTL methods face limitations in addressing the inherent head-tail trade-off, handling distribution shifts, and accommodating varying user preferences. To overcome these issues, we formulate LTL as a multi-objective optimization problem and propose a hypernetwork-based diverse expert learning paradigm, achieving interpretable and controllable solutions tailored to user needs under test distribution shifts.', '---', '> •   **Data-Based Transfer** [20,14]:', '>     -   *Mechanism*: Techniques like knowledge distillation and feature transformation are used to transfer knowledge from head classes to tail classes.', '>     -   *Disadvantages*: Assumes head and tail distributions are sufficiently related, and may suffer from negative transfer if this assumption is violated.', '> •   **Model-Based Transfer** [36]:', '>     -   *Mechanism*: Utilizes models pretrained on more abundant head classes to facilitate the learning of tail classes.', '>     -   *Disadvantages*: Similar to data-based transfer, it assumes related head and tail distributions, which can lead to suboptimal performance if the domains diverge significantly.', '> Despite significant progress, existing LTL methods face inherent limitations in addressing the trade-off between head and tail classes, handling diverse distribution shifts, and accommodating varying user preferences. To overcome these critical issues, we formulate long-tailed learning as a multi-objective optimization problem and propose a novel hypernetwork-based diverse expert learning paradigm, achieving interpretable and controllable solutions tailored to user needs under arbitrary test distribution shifts.', '157,162c182,186', '< Hypernetworks [18] provide a promising approach for multi-objective optimization of neural networks. A hypernetwork h ϕ : Z → Θ is a neural network that takes a low-dimensional input z ∈ Z and outputs the parameters θ ∈ Θ of a target neural network f θ : X → Y. By sampling different z ∈ Z, the hypernetwork generates an ensemble {f θi } i where θ i = h ϕ (z i ). This ensemble can approximate the Pareto front of the multi-objective optimization problem:', '< min θ∈Θ {L 1 (f θ ), . . . , L m (f θ )}(19)', '< where L i : Θ → R are loss functions corresponding to the m objectives. The hypernetwork parameters ϕ can be optimized via scalarizations like the Chebyshev method:', '< min ϕ E w∼p(w) max 1≤i≤m w i L i (f h ϕ (z) ) -z * i (20', '< )', '< where p(w) is a distribution over weight vectors w. This enables learning a diverse set of target networks approximating the Pareto front in a flexible and controllable manner.', '---', '> Hypernetworks [18] offer a promising approach for multi-objective optimization of neural networks. A hypernetwork h ϕ : Z → Θ is a neural network that takes a low-dimensional input z ∈ Z and outputs the parameters θ ∈ Θ of a target neural network f θ : X → Y. By sampling different z ∈ Z, the hypernetwork generates an ensemble {f θi } i where θ i = h ϕ (z i ). This ensemble can effectively approximate the Pareto front of the multi-objective optimization problem, which is defined as:', '> min θ∈Θ {L 1 (f θ ), . . . , L m (f θ )} (19)', '> Here, L i : Θ → R are loss functions corresponding to the m distinct objectives. The hypernetwork parameters ϕ can be optimized using various scalarization methods, such as the Chebyshev method:', '> min ϕ E w∼p(w) [max 1≤i≤m w i (L i (f h ϕ (z) ) -z * i )] (20)', '> where p(w) is a distribution over weight vectors w, and z* is a reference point. This approach enables learning a diverse set of target networks that collectively approximate the Pareto front in a flexible and controllable manner, which is crucial for our controllable long-tailed learning paradigm.', '199c223', '< Next, we will explain the connection between the theoretical part of the paper and the proposed method, to aid better understanding  The above theoretical analysis demonstrates that, in the long-tailed learning domain, introducing multiple training environments and minimizing the empirical risks on these environments to learn a set of diverse experts can effectively address the problem of distribution shift between the training and test environments, leading to better generalization performance. These theoretical insights provide important guidance for further improving our algorithm.', '---', '> Next, we explain the connection between the theoretical analysis and our proposed method. The theoretical results demonstrate that, in long-tailed learning, introducing multiple training environments and minimizing empirical risks across these environments to learn a set of diverse experts can effectively address the problem of distribution shift between training and test environments, leading to better generalization performance. These theoretical insights provide important guidance for the design and further improvement of our algorithm.', '202,204c226,231', '< To evaluate the effectiveness of our proposed method, we conduct experiments on four long-tailed datasets: CIFAR100-LT, ImageNet-LT, iNaturalist 2018, and Places365-LT. These datasets cover a diverse range of domains and exhibit varying degrees of class imbalance, providing a comprehensive testbed for long-tailed learning algorithms. ImageNet-LT [20] is a long-tailed subset of the ImageNet dataset, containing over 115,000 images spanning 1,000 classes. The class cardinalities follow a Pareto distribution with α = 6, leading to a maximum imbalance ratio of 256.', '< iNaturalist 2018 [29] is a real-world dataset with a natural long-tailed distribution, comprising approximately 450,000 images across 8,142 species. The number of images per species varies drastically, with an imbalance ratio of up to 500, posing a significant challenge due to the extreme class imbalance and high intra-class variation.', '< Places365-LT is a long-tailed version of the Places365 dataset [42], which consists of over 1.8 million images spanning 365 scene categories. We induce a long-tailed distribution by randomly subsampling the images for each class, resulting in an imbalance ratio of approximately 50. This dataset is particularly challenging due to the large number of classes and the inherent visual ambiguity present in scene recognition tasks.', '---', '> To thoroughly evaluate the effectiveness and generalization capabilities of our proposed method, we conduct extensive experiments on four widely-recognized long-tailed datasets: CIFAR100-LT, ImageNet-LT, iNaturalist 2018, and Places365-LT. These datasets represent a diverse range of domains and exhibit varying degrees of class imbalance, thereby providing a comprehensive and challenging testbed for long-tailed learning algorithms.', '> •   **ImageNet-LT** [20]: This dataset is a long-tailed subset derived from the large-scale ImageNet dataset. It comprises over 115,000 images distributed across 1,000 classes. The class cardinalities follow a Pareto distribution with a parameter α = 6, resulting in a significant maximum imbalance ratio of 256.', '> •   **iNaturalist 2018** [29]: A real-world dataset characterized by a naturally occurring long-tailed distribution. It contains approximately 450,000 images spanning 8,142 distinct species. The number of images per species varies drastically, with an extreme imbalance ratio reaching up to 500. This dataset poses a substantial challenge due to its severe class imbalance and high intra-class variation.', '> •   **Places365-LT**: This is a long-tailed variant of the Places365 dataset [42], which originally consists of over 1.8 million images categorized into 365 scene classes. We induce a long-tailed distribution by randomly subsampling images for each class, achieving an imbalance ratio of approximately 50. This dataset is particularly challenging given the large number of classes and the inherent visual ambiguity in scene recognition tasks.', "> •   **CIFAR100-LT** [4]: This dataset is a long-tailed version of the standard CIFAR100 dataset. We evaluate our method on three distinct versions of CIFAR100-LT, corresponding to imbalance ratios (IR) of 10, 50, and 100. These controlled settings allow for a systematic analysis of our method's performance under different levels of data imbalance.", '> Detailed statistics for all datasets, including class distributions and sample counts, are provided in Table 5.', '207,215c234,237', '< Here are pseudo codes explaining the core aspects of the method:  for k = 1 to K do 10:', '< ϕ k = H θ (z k ) {Generate expert weights ϕ k from hypernetwork}', '< 11:', '< E k = E ϕ (f ) {Obtain expert predictions using ϕ k } 12:', '< L k = L(E k , y, extra_info) {Compute expert losses} 13:', '< end for 14:', '< L div = DiversityLoss(E) {Encourage expert diversity} 15:', '< L total = k L k + λL div 16: θ = θ -η∇ θ L total 17:', '< end for 18: end for=0 Total loss for updating hypernetwork weights', '---', '> Here are pseudo codes explaining the core aspects of our method:', '> Algorithm 1: Diverse Expert Learning with Hypernetworks', '> Input: Training data with long-tailed distribution D_train', '> Output: Ensemble of expert models E = {E_1, E_2, ..., E_K}', '216a239,257', '> 1: Initialize shared feature extractor F', '> 2: Initialize hypernetwork H_θ with weights θ', '> 3: Initialize expert loss function L (e.g., DiverseExpertLoss)', '> ', '> 4: for epoch = 1 to max_epochs do', '> 5:   for each batch B ⊆ D_train do', '> 6:     f = F(x) {Shared feature extraction}', '> 7:     for k = 1 to K do', '> 8:       z_k ~ Dir(α) {Sample input for hypernetwork}', '> 9:       ϕ_k = H_θ(z_k) {Generate expert weights ϕ_k from hypernetwork}', '> 10:      E_k = G_ϕ_k(f) {Obtain expert predictions using ϕ_k, where G is the classifier head}', '> 11:      L_k = L(E_k, y, extra_info) {Compute expert losses}', '> 12:    end for', '> 13:    L_div = DiversityLoss(E) {Encourage expert diversity, e.g., using SCE regularization}', '> 14:    L_total = Σ_k L_k + λ * L_div {Total loss for updating feature extractor and hypernetwork weights}', '> 15:    θ = θ - η∇_θ L_total {Update feature extractor and hypernetwork weights}', '> 16:  end for', '> 17: end for', '> ', '218,220c259,263', "< On the representative Places-LT dataset, our PRL method achieves the best Top-1 accuracy under various unknown test class distributions. Specifically, in the Forward-LT setting, as the proportion of unknown classes decreases from 50% to 2%, the Top-1 accuracy of PRL drops from 47.9% to 42.8%, but still significantly outperforms other baseline methods. Under the Uniform distribution, PRL reaches the highest accuracy of 41.9%. In the Backward-LT setting, PRL's accuracy gradually increases from 41.7% to 44.1%, again surpassing all counterpart methods. These results thoroughly validate the outstanding performance and robustness of our method in handling various unknown class distributions.", "< On the iNaturalist 2018 dataset, PRL also exhibits excellent performance. In the Forward-LT setting, when the proportion of unknown classes decreases from 3 to 2, PRL's Top-1 accuracy slightly increases from 73.7% to 73.8%, and reaches the best performance of 74.3% under the Uniform distribution. In the Backward-LT setting, although PRL's accuracy slightly decreases from 74.0% to 73.9%, it still outperforms all comparison methods. These results further confirm the broad effectiveness of our method across different datasets and scenarios.", '< Overall, by successfully tackling the challenges of long-tailed distributions and unknown class distributions, the PRL method demonstrates superior performance on two representative long-tailed datasets, thereby validating the superiority of our method.  9: Model size and computational cost with and without hypernetworks.', '---', "> On the representative Places-LT dataset, our PRL method achieves the best Top-1 accuracy under various unknown test class distributions. Specifically, in the Forward-LT setting, as the proportion of unknown classes decreases from 50% to 2%, the Top-1 accuracy of PRL drops from 47.9% to 42.8%, but still significantly outperforms other baseline methods. Under the Uniform distribution, PRL reaches the highest accuracy of 41.9%. In the Backward-LT setting, PRL's accuracy gradually increases from 41.7% to 44.1%, again surpassing all counterpart methods. These consistent results thoroughly validate the outstanding performance and robustness of our method in handling diverse unknown class distributions.", "> On the iNaturalist 2018 dataset, PRL also exhibits excellent performance. In the Forward-LT setting, when the proportion of unknown classes decreases from 3 to 2, PRL's Top-1 accuracy slightly increases from 73.7% to 73.8%, and reaches the best performance of 74.3% under the Uniform distribution. In the Backward-LT setting, although PRL's accuracy slightly decreases from 74.0% to 73.9%, it consistently outperforms all comparison methods. These results further confirm the broad effectiveness and generalization capability of our method across different datasets and scenarios.", '> Overall, by successfully tackling the challenges of long-tailed distributions and unknown class distributions, the PRL method consistently demonstrates superior performance on these representative long-tailed datasets, thereby validating the robustness and effectiveness of our approach.', '> ', '> Section: G Complexity Analysis', '227c270', '< While the proposed novel approach of using a hypernetwork to generate multiple diverse expert models shows great potential in enabling controllable adjustment of head and tail class weights for long-tailed datasets, as well as improving robustness to distribution shifts, the introduction of the hypernetwork also brings new challenges to model training and convergence. As an additional neural network module, the hypernetwork needs to generate the weight parameters for the classifier heads of each expert, thereby significantly increasing the total number of trainable parameters in the model, which may affect training stability. We analyze this issue in Section G, nevertheless, further research into more efficient training and stable controllability is still necessary.', '---', '> While our proposed novel approach, utilizing a hypernetwork to generate multiple diverse expert models, demonstrates significant potential in enabling controllable adjustment of head and tail class weights for long-tailed datasets and improving robustness to distribution shifts, its introduction also presents new challenges for model training and convergence. As an additional neural network module, the hypernetwork generates weight parameters for the classifier heads of each expert, which can significantly increase the total number of trainable parameters. This increase may affect training stability and convergence speed. We provide an analysis of its computational overhead in Section G, but further research into more efficient training strategies and ensuring stable controllability under various complex scenarios is still necessary.', '230c273', '< The proposed novel approach of generating multiple expert models via hypernetworks enables dynamic adjustment of head and tail class weights for long-tailed datasets, and improves model robustness to distribution shifts. This flexibility and robustness are of significant value in many practical applications. The present work provides a new viable solution to the important challenges of long-tailed distributions and distribution shifts, holding promise to enhance the generalization capabilities and practical applicability of existing models, thereby contributing to the technological advancement in relevant fields.', '---', '> Our proposed novel approach, which generates multiple expert models via hypernetworks, enables dynamic adjustment of head and tail class weights for long-tailed datasets and significantly improves model robustness to distribution shifts. This enhanced flexibility and robustness are of considerable value across a multitude of practical applications, particularly in fields where data imbalance and dynamic test conditions are prevalent (e.g., medical diagnosis, ecological monitoring, fraud detection). This work provides a new and viable solution to the critical challenges of long-tailed distributions and distribution shifts. By enhancing the generalization capabilities and practical applicability of existing models, our research holds promise to contribute positively to technological advancements in relevant fields, fostering more equitable and effective AI systems.', '486c529', '< Formula formula_0: R test (f ) = R m (f ) + K i=1 (π test i -π m i ) • E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)](1)', '---', '> Formula formula_0: R_{test} (f ) = R_m (f ) + \\sum_{i=1}^K (\\pi_{test_i} - \\pi_{m_i}) • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] (1)', '488c531', '< Formula formula_1: δ(E i , E j ) = 1 2 K k=1 |π i k -π j k |', '---', '> Formula formula_1: δ(E_i , E_j ) = \\frac{1}{2} \\sum_{k=1}^K |\\pi_{i_k} - \\pi_{j_k}|', '490c533,535', '< Formula formula_2: R test (f ) ≤ R m (f ) + 2M • (δ(E m , E test ) + ∆(E 1 , . . . , E M ))(2)', '---', '> Formula formula_2: R_{test} (f ) ≤ R_m (f ) + 2M • (δ(E_m , E_{test}) + ∆(E_1 , \\ldots , E_M )) (2)', '> ', '> Formula formula_3: R_m ( f ) = \\frac{1}{N} \\sum_{i=1}^N R_m (f_i) (3)', '492c537', '< Formula formula_3: Rm ( f ) = 1 N N i=1 R m (f i )(3)', '---', '> Formula formula_4: R_{test} ( f ) ≤ \\frac{1}{N} \\sum_{m=1}^N R_m (f_m) + 2M • \\frac{1}{N} \\sum_{m=1}^N δ(E_m , E_{test}) + \\frac{N - 1}{N} ∆(E_1 , \\ldots , E_N) (4)', '494c539', '< Formula formula_4: R test ( f ) ≤ 1 N N m=1 R m (f m ) + 2M • 1 N N m=1 δ(E m , E test ) + N -1 N ∆(E 1 , . . . , E N )(4)', '---', '> Formula formula_5: M = \\max_{i,x} ℓ( f (x), i).', '496,498c541,543', '< Formula formula_5: M = max i,x ℓ( f (x), i).', '< ', '< Formula formula_6: ∆ M := α ∈ R M + | M i=1 α i = 1(5)', '---', '> Formula formula_6: ∆_M := \\{\\alpha ∈ R^M_+ | \\sum_{i=1}^M \\alpha_i = 1\\} (5)', '> ', '> Formula formula_7: P_α (x, y) := \\sum_{k=1}^K \\alpha_k • P k (x | y) • P k (y) (6)', '500c545', '< Formula formula_7: P α (x, y) := K k=1 α k • P k (x | y) • P k (y)(6)', '---', '> Formula formula_8: \\min_F \\{R_{Pα_1} (F), \\ldots , R_{Pα_M} (F)\\} (7)', '502c547', '< Formula formula_8: min F R Pα 1 (F), . . . , R Pα M (F)(7)', '---', '> Formula formula_9: R_{Pα} (F) := E_{(x,y)∼Pα} [\\frac{1}{M} \\sum_{i=1}^M ℓ(f^{(i)} (x), y)]', '504c549', '< Formula formula_9: R Pα (F) := E (x,y)∼Pα 1 M M i=1 ℓ(f (i) (x), y)', '---', '> Formula formula_10: f_i (x) = g_{w_i} (ϕ_θ (x)), \\text{ for } i = 1, \\ldots, T (8)', '506c551', '< Formula formula_10: f i (x) = g wi (ϕ θ (x)), i = 1, . . . , T(8)', '---', '> Formula formula_11: w_i = h_ψ (z_i), \\text{ where } z_i \\sim \\text{Dir}(\\alpha), \\text{ for } i = 1, \\ldots, T (9)', '508c553', '< Formula formula_11: w i = h ψ (z i ), z i ∼ Dir(α), i = 1, . . . , T(9)', '---', '> Formula formula_12: L = \\sum_{i=1}^T L_i (f_i) (10)', '510c555', '< Formula formula_12: L = T i=1 L i (f i )(10)', '---', '> Formula formula_13: \\min_Θ \\sum_{i=1}^T L_i (Θ, D) (11)', '512c557', '< Formula formula_13: min Θ T i=1 L i (Θ, D)(11)', '---', '> Formula formula_14: \\min_Θ \\max_{p∈∆_T} \\sum_{i=1}^T p_i L_i (Θ, D) (12)', '514c559', '< Formula formula_14: min Θ max p∈∆ T T i=1 p i L i (Θ, D)(12)', '---', '> Formula formula_15: ∆_T := \\{p ∈ R^T_+ | \\sum_{i=1}^T p_i = 1\\} \\text{ is the T-dimensional simplex.}', '516c561', '< Formula formula_15: ∆ T := {p ∈ R T + | T i=1 p i = 1} is the T -dimensional simplex.', '---', '> Formula formula_16: \\min_Θ [\\sum_{i=1}^T L_i (Θ, D) + λ • \\log (\\sum_{i=1}^T \\exp(\\frac{1}{λ} L_i (Θ, D)))] (13)', '518c563', '< Formula formula_16: min Θ T i=1 L i (Θ, D) + λ • log T i=1 exp 1 λ L i (Θ, D)(13)', '---', '> Formula formula_17: \\alpha^* = (\\alpha^*_1 , \\alpha^*_2 , \\alpha^*_3 )^⊤ ∈ ∆_3', '520c565', '< Formula formula_17: α * = (α * 1 , α * 2 , α * 3 ) ⊤ ∈ ∆ 3', '---', "> Formula formula_18: r' = r ⊙ \\alpha^* / (r^⊤ \\alpha^*) (14)", '522c567', '< Formula formula_18: r = r ⊙ α * r ⊤ α * (14', '---', '> Formula formula_19: # Removed redundant line', '524c569', '< Formula formula_19: )', '---', "> Formula formula_20: \\hat{W}_i = h_ψ (r'), \\text{ for } i = 1, \\ldots, T (15)", '526,528c571', '< Formula formula_20: Ŵi = h ψ (r), i = 1, • • • , T(15)', '< ', '< Formula formula_21: ŷi = \uf8f1 \uf8f4 \uf8f2 \uf8f4 \uf8f3 x ⊤ ∥x∥ 2 • Ŵi ∥ Ŵi ∥ F , if normalized x ⊤ Ŵi + b⊤ i , otherwise(16)', '---', '> Formula formula_21: \\hat{y}_i = \\begin{cases} x^⊤ / \\|x\\|_2 • \\hat{W}_i / \\| \\hat{W}_i \\|_F , & \\text{if normalized feature} \\\\ x^⊤ \\hat{W}_i + b^⊤_i , & \\text{otherwise} \\end{cases} (16)', '530c573', '< Formula formula_22: min x∈X {f 1 (x), . . . , f m (x)}(17)', '---', '> Formula formula_22: \\min_{x∈X} \\{f_1 (x), \\ldots , f_m (x)\\} (17)', '532c575', '< Formula formula_23: min x∈X max 1≤i≤m w i (f i (x) -z * i )(18)', '---', '> Formula formula_23: \\min_{x∈X} \\max_{1≤i≤m} w_i (f_i (x) - z^*_i) (18)', '534c577', '< Formula formula_24: min θ∈Θ {L 1 (f θ ), . . . , L m (f θ )}(19)', '---', '> Formula formula_24: \\min_θ \\{L_1 (f_θ ), \\ldots , L_m (f_θ )\\} (19)', '536c579', '< Formula formula_25: min ϕ E w∼p(w) max 1≤i≤m w i L i (f h ϕ (z) ) -z * i (20', '---', '> Formula formula_25: \\min_ϕ E_{w∼p(w)} [\\max_{1≤i≤m} w_i (L_i (f_{h_ϕ (z)}) - z^*_i)] (20)', '538c581', '< Formula formula_26: )', '---', '> Formula formula_26: # Removed redundant line', '540c583', '< Formula formula_27: R test (f ) = E (x,y)∼Ptest(x,y) [ℓ(f (x; θ), y)](21)', '---', '> Formula formula_27: R_{test} (f ) = E_{(x,y)∼Ptest(x,y)} [ℓ(f (x; θ), y)] (21)', '542c585', '< Formula formula_28: R test (f ) = K i=1 π test i • E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)](22)', '---', '> Formula formula_28: R_{test} (f ) = \\sum_{i=1}^K \\pi_{test_i} • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] (22)', '544c587', '< Formula formula_29: R m (f ) = K i=1 π m i • E x∼Pm(x|y=i) [ℓ(f (x; θ), i)](23)', '---', '> Formula formula_29: R_m (f ) = \\sum_{i=1}^K \\pi_{m_i} • E_{x∼Pm(x|y=i)} [ℓ(f (x; θ), i)] (23)', '546c589', '< Formula formula_30: R test (f ) = K i=1 π test i • E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)](24)', '---', '> Formula formula_30: R_{test} (f ) = \\sum_{i=1}^K \\pi_{test_i} • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] (24)', '548c591', '< Formula formula_31: = K i=1 π m i • E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)] + K i=1 (π test i -π m i ) • E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)] = R m (f ) + K i=1 (π test i -π m i ) • E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)]', '---', '> Formula formula_31: = \\sum_{i=1}^K \\pi_{m_i} • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] + \\sum_{i=1}^K (\\pi_{test_i} -\\pi_{m_i}) • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] = R_m (f ) + \\sum_{i=1}^K (\\pi_{test_i} -\\pi_{m_i}) • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)]', '550c593', '< Formula formula_32: R test (f ) = R m (f ) + K i=1 (π test i -π m i ) • E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)](25)', '---', '> Formula formula_32: R_{test} (f ) = R_m (f ) + \\sum_{i=1}^K (\\pi_{test_i} -\\pi_{m_i}) • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] (25)', '552c595', '< Formula formula_33: E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)] ≤ M (26) Therefore, R test (f ) ≤ R m (f ) + K i=1 (π test i -π m i ) • M = R m (f ) + M • K i=1 |π test i -π m i | = R m (f ) + 2M • δ(E m , E test )', '---', '> Formula formula_33: E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] ≤ M (26) Therefore, R_{test} (f ) ≤ R_m (f ) + \\sum_{i=1}^K (\\pi_{test_i} -\\pi_{m_i}) • M = R_m (f ) + M • \\sum_{i=1}^K |\\pi_{test_i} -\\pi_{m_i}| = R_m (f ) + 2M • δ(E_m , E_{test})', '554c597', '< Formula formula_34: 1 2 K i=1 |π m i -π test i |. Next, we use the triangle inequality to bound δ(E m , E test ): δ(E m , E test ) ≤ δ(E m , E j ) + δ(E j , E test ) ≤ max i,j∈{1,...,M } δ(E i , E j ) + δ(E j , E test ) ≤ ∆(E 1 , . . . , E M ) + δ(E j , E test )', '---', '> Formula formula_34: \\frac{1}{2} \\sum_{i=1}^K |\\pi_{m_i} - \\pi_{test_i}|. Next, we use the triangle inequality to bound δ(E_m , E_{test}): δ(E_m , E_{test}) ≤ δ(E_m , E_j ) + δ(E_j , E_{test}) ≤ \\max_{i,j∈\\{1,...,M \\}} δ(E_i , E_j ) + δ(E_j , E_{test}) ≤ ∆(E_1 , \\ldots , E_M ) + δ(E_j , E_{test})', '556c599', '< Formula formula_35: δ(E m , E test ) ≤ ∆(E 1 , . . . , E M ) + 1 M M j=1 δ(E j , E test )(27)', '---', '> Formula formula_35: δ(E_m , E_{test}) ≤ ∆(E_1 , \\ldots , E_M ) + (1/M) * \\sum_{j=1}^M δ(E_j , E_{test}) (27)', '558c601', '< Formula formula_36: R test (f ) ≤ R m (f ) + 2M • \uf8eb \uf8ed ∆(E 1 , . . . , E M ) + 1 M M j=1 δ(E j , E test ) \uf8f6 \uf8f8(28)', '---', '> Formula formula_36: R_{test} (f ) ≤ R_m (f ) + 2M • (∆(E_1 , \\ldots , E_M ) + (1/M) * \\sum_{j=1}^M δ(E_j , E_{test})) (28)', '560c603', '< Formula formula_37: R test ( f ) = E (x,y)∼Ptest [ℓ( f (x), y)] = 1 N N i=1 E (x,y)∼Ptest [ℓ(f i (x), y)] = 1 N N i=1 R test (f i )', '---', '> Formula formula_37: R_{test} ( f ) = E_{(x,y)∼Ptest} [ℓ( f (x), y)] = (1/N) * \\sum_{i=1}^N E_{(x,y)∼Ptest} [ℓ(f_i (x), y)] = (1/N) * \\sum_{i=1}^N R_{test} (f_i)', '562c605', '< Formula formula_38: R test ( f ) ≤ 1 N N i=1 R m(i) (f i ) + 2M • (δ(E m(i) , E test ) + ∆(E 1 , . . . , E N )) = 1 N N i=1 R m(i) (f i ) + 2M N N i=1 δ(E m(i) , E test ) + 2M ∆(E 1 , . . . , E N )', '---', '> Formula formula_38: R_{test} ( f ) ≤ (1/N) * \\sum_{i=1}^N [R_{m(i)} (f_i) + 2M • (δ(E_{m(i)} , E_{test}) + ∆(E_1 , \\ldots , E_N))] = (1/N) * \\sum_{i=1}^N R_{m(i)} (f_i) + (2M/N) * \\sum_{i=1}^N δ(E_{m(i)} , E_{test}) + 2M * ∆(E_1 , \\ldots , E_N)', '564c607', '< Formula formula_39: 1 N N i=1 δ(E m(i) , E test ) = 1 N N m=1 i:m(i)=m δ(E m , E test ) ≤ 1 N N m=1 N m • δ(E m , E test ) ≤ 1 N N m=1 N • δ(E m , E test ) = N m=1 δ(E m , E test )', '---', '> Formula formula_39: (1/N) * \\sum_{i=1}^N δ(E_{m(i)} , E_{test}) = (1/N) * \\sum_{m=1}^N (\\sum_{i:m(i)=m} 1) * δ(E_m , E_{test}) ≤ (1/N) * \\sum_{m=1}^N N_m • δ(E_m , E_{test})', '566c609', '< Formula formula_40: R test ( f ) ≤ 1 N N m=1 R m (f m ) + 2M • 1 N N m=1 δ(E m , E test ) + ∆(E 1 , . . . , E N ) = 1 N N m=1 R m (f m ) + 2M • 1 N N m=1 δ(E m , E test ) + N -1 N ∆(E 1 , . . . , E N )', '---', '> Formula formula_40: R_{test} ( f ) ≤ (1/N) * \\sum_{m=1}^N R_m (f_m) + 2M • (1/N) * \\sum_{m=1}^N δ(E_m , E_{test}) + ((N-1)/N) * ∆(E_1 , \\ldots , E_N)', '568c611', '< Formula formula_41: ϕ k = H θ (z k ) {Generate expert weights ϕ k from hypernetwork}', '---', '> Formula formula_41: ϕ_k = H_θ (z_k) \\text{ \\{Generate expert weights ϕ_k from hypernetwork\\}}', '570c613', '< Formula formula_42: E k = E ϕ (f ) {Obtain expert predictions using ϕ k } 12:', '---', '> Formula formula_42: E_k = G_{ϕ_k} (f) \\text{ \\{Obtain expert predictions using ϕ_k, where G is the classifier head\\}}', '572c615', '< Formula formula_43: L total = k L k + λL div 16: θ = θ -η∇ θ L total 17:', '---', '> Formula formula_43: L_{total} = \\sum_k L_k + λ * L_{div}', '575d617', '< ']
