Title: Breaking Long-Tailed Learning Bottlenecks: A Controllable Paradigm with Hypernetwork-Generated Diverse Experts

Abstract: Traditional long-tailed learning methods often struggle with distribution shifts between training and test data, and lack flexible adaptation to user preferences for head and tail class trade-offs. To address this, we propose a novel long-tailed learning paradigm that leverages hypernetworks to generate a diverse set of expert models. This approach enables the model ensemble to robustly adapt to various test distributions while offering controllable adjustments according to user-defined preferences. Unlike prior methods that yield fixed trade-offs, our paradigm allows for dynamic, interpretable control over the balance between head and tail class performance. Extensive experiments demonstrate that our method not only achieves superior performance but also effectively overcomes distribution shifts and provides unprecedented controllable adjustments based on user preferences. This work offers new insights and a flexible paradigm for long-tailed learning, significantly expanding its practical applicability. The code can be found here: https://github.com/DataLab-atom/PRL. * Pengkun Wang and Yang Wang are corresponding authors. 38th Conference on Neural Information Processing Systems (NeurIPS 2024).

Section: Introduction
In many real-world tasks such as object detection and image classification, we face the challenge of long-tailed distributions. Since the samples of the head classes account for the vast majority in the datasets while the tail class samples are extremely scarce [21,16,6,17], this extreme imbalance in the data makes the model prone to overfitting towards the head classes during training, resulting in poor performance on the tail classes [34,8,30,20,39].
To address the long-tailed distribution problem, existing research has proposed a series of methods such as re-sampling [25,5,24,10] and modifying the loss function [17,6], with the common idea of focusing on improving the performance of the tail classes. However, these methods typically assume that the distributions of the training and test data remain invariant, and thus cannot well handle the common situations of distribution shift between training and testing in real-world scenarios.

Some more recent works, such as RIDE [32] and LADE [12], propose using multiple expert models to obtain stronger distribution adaptability. Building on this, SADE [38] further adaptively combines the outputs of these experts during testing to adapt to the current test distribution. While these approaches alleviate the problem of distribution mismatch between training and testing to some extent [23], they primarily aim to maximize overall performance, pursuing a fixed optimal performance metric across all classes [13,40]. This fixed trade-off often fails to meet the diverse needs and preferences of users in different application scenarios [17,35,43].

For example, in classifying lung CT images, when screening for difficult cases, a user might prioritize covering all possible disease types (i.e., tail classes) to avoid missed diagnoses, even if it moderately increases the false positive rate for common conditions. Conversely, in routine physical examinations, the focus might be on high accuracy for head classes. Another example is wildlife detection: within nature reserves, accurately detecting common species (head classes) is crucial for population monitoring, but when searching for rare species (tail classes), maximizing coverage of all species becomes paramount, potentially at the cost of some false detections. These scenarios highlight significant differences in user preferences for weighting head and tail categories, which current long-tailed learning methods often fail to fully satisfy.

Therefore, developing an interpretable and controllable method for handling long-tail distributions that adapts to specific user preferences for head and tail categories becomes a critical new research direction. In light of this, we propose an interpretable and controllable long-tail learning method (PRL). This method aims not only to overcome potential distribution shifts from a single training distribution to any testing distribution but, more importantly, to flexibly adjust the weights of head and tail categories according to actual user demands.

To address these challenges, we introduce a new long-tailed learning paradigm, PRL, based on diverse experts and hypernetworks, as illustrated in Figure 1. For the first challenge, existing multi-expert model-based methods train fixed expert models for specific distributions, requiring strong distribution assumptions and struggling to handle more complex and variable distributions. Instead of maximizing the performance of each expert individually, we pursue modeling and optimizing the hypervolume over the entire Pareto front curve, learning a set of solutions that cover all possible distribution scenarios. This approach requires us to sample with the goal of covering the entire Pareto front during optimization. For the second challenge, unlike LADE and SADE which output a fixed trade-off solution under distribution shift, PRL can flexibly output a dedicated model solution that matches the user's preference in any test distribution scenario. In this way, our method not only adapts to changes in the test distribution but also allows controllable adjustment of the head-tail trade-off according to the user's actual needs.

Our contributions can be summarized as follows:
•   **Novel Problem Formulation and Insight:** We are the first to propose a controllable trade-off mechanism based on user preferences in the context of long-tailed learning with test distribution shifts, significantly expanding the applicability of LTL in real-world scenarios.
•   **New Learning Paradigm:** We introduce PRL, an interpretable and controllable long-tailed learning method that leverages hypernetworks to acquire the ability to overcome test distribution shifts from a single training dataset and satisfy diverse user preferences in any shifted distribution scenario.
•   **Compelling Empirical Results:** Extensive experiments demonstrate that our method achieves higher performance ceilings, effectively overcomes test distribution shifts, and provides fine-grained control over head-tail class trade-offs according to user preferences across various benchmark datasets.

Section: Related Work
Long-tailed distributions are prevalent in real-world data, leading to imbalanced datasets that pose significant challenges for machine learning models [30,20]. To address this issue, researchers have proposed various methods, broadly categorized into re-sampling, loss function modification, and multi-expert models.
Re-sampling methods aim to balance class distributions by oversampling tail classes [24,3] or undersampling head classes [7]. While effective in mitigating imbalance, oversampling can lead to overfitting, and undersampling may discard valuable information.
Loss function modification approaches assign higher weights to tail class losses [27,26] or use meta-learning to alleviate undersampling issues [14,32]. These methods directly influence the model's learning focus but often require careful hyperparameter tuning.
Multi-expert models train multiple specialized experts on different class distributions and combine their outputs, adapting to various test distributions [37,38,31]. These methods show promise in handling distribution shifts.
However, most existing methods share common limitations: they often assume specific distributions during training or testing, which limits their real-world applicability in the face of dynamic distribution shifts. Crucially, they cannot accommodate varying user needs for head and tail class trade-offs, providing only a fixed, "one-size-fits-all" solution. Our proposed approach directly tackles these limitations by overcoming rigid distribution assumptions and achieving interpretable, controllable trade-offs in long-tailed learning.

Section: Theory
In this section, we provide a theoretical analysis of the distribution shift problem and introduce the concept of Environment Total Variation Distance (ETVD), which forms the theoretical foundation for our proposed methodology. Traditional Empirical Risk Minimization (ERM) methods, when trained on a single distribution, often struggle to generalize effectively under distribution shifts, a limitation formally characterized by the following theorem:

Theorem 1 (Limitation of ERM). Let f (x; θ) be a classifier learned via ERM on environment E_m . Its risk on a test environment E_{test} is given by:
R_{test} (f ) = R_m (f ) + \sum_{i=1}^K (\pi_{test_i} - \pi_{m_i}) • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] (1)
where R_m (f ) and R_{test} (f ) are the risks of f on E_m and E_{test} , respectively, and $\pi_{test}$ is the class prior of the test environment.

To quantify the distribution discrepancy across different environments, we introduce the Environment Total Variation Distance (ETVD):
Definition 2 (Environment Total Variation Distance). The total variation distance between environments E_i and E_j is defined as:
δ(E_i , E_j ) = \frac{1}{2} \sum_{k=1}^K |\pi_{i_k} - \pi_{j_k}|
, and the ETVD of M environments is defined as:
∆(E_1 , \ldots , E_M ) = \max_{i,j∈\{1,\ldots,M \}} δ(E_i , E_j )
Using ETVD, we can further bound the risk of the ERM-learned classifier on the test environment:

Corollary 1. Under the assumptions of Theorem 1, let M = \max_{i,x} ℓ(f (x; θ), i). Then,
R_{test} (f ) ≤ R_m (f ) + 2M • (δ(E_m , E_{test}) + ∆(E_1 , \ldots , E_M )) (2)
This corollary demonstrates that the test risk of an ERM-learned classifier is influenced not only by the direct distribution discrepancy between its training environment and the test environment but also by the overall distribution discrepancy among all available training environments (i.e., ETVD). To mitigate this, our approach involves minimizing empirical risks across multiple training environments to capture diverse distributional characteristics, thereby learning a set of diverse experts.

Next, we provide a theoretical analysis of the generalization performance of our diversity-aware ensemble algorithm. We first introduce the following notations:
Let $\{f_1 , \ldots , f_N \}$ be the N experts learned via ERM on the N training environments $\{E_1 , \ldots , E_N \}$, respectively, and f be the final classifier obtained by ensembling these N experts. Define the empirical risk of the ensemble classifier f on environment E_m as:
R_m ( f ) = \frac{1}{N} \sum_{i=1}^N R_m (f_i) (3)
We can then state the following theorem regarding the generalization performance of the ensemble classifier:

Theorem 2. Under the above notations and definitions, the risk of the ensemble classifier f on the test environment E_{test} satisfies:
R_{test} ( f ) ≤ \frac{1}{N} \sum_{m=1}^N R_m (f_m) + 2M • \frac{1}{N} \sum_{m=1}^N δ(E_m , E_{test}) + \frac{N - 1}{N} ∆(E_1 , \ldots , E_N) (4)
where M = \max_{i,x} ℓ( f (x), i).
Theorem 2 reveals that the test risk of the ensemble classifier is composed of three key components: the average empirical risk of all experts, the average total variation distance between the training environments and the test environment, and a weighted average of the ETVD among the training environments. Compared to single-environment ERM, the diversity experts method learns a set of experts that collectively capture the distributional characteristics of various environments. This strategy effectively reduces the distribution discrepancy between the training and test environments, leading to superior generalization performance.

Section: Methodology


Section: Problem Formulation
Consider a K-class classification problem with a training set D = {(x i , y i )} N i=1 , where each class k has N k samples. Let P train denote the empirical distribution over D. The goal is to learn a classifier f : X → R K that generalizes well across various test distributions P test . Traditional empirical risk minimization (ERM) methods optimize the loss under P train , but may fail to adapt to changes in P test , especially in long-tailed scenarios.
To improve the robustness of f , we optimize the losses under multiple importance-weighted distributions. Define an M -dimensional simplex:
∆ M := α ∈ R M + | M i=1 α i = 1(5)
Each α ∈ ∆ M corresponds to an importance-weighted distribution P α :
P α (x, y) := K k=1 α k • P k (x | y) • P k (y)(6)
where P k (x | y) and P k (y) = N k N are the conditional distribution and prior for class k, respectively. The objective is to learn a set of classifiers F := {f (i) } M i=1 that achieve low risk simultaneously across all P α , forming the Pareto optimal solution:
min F R Pα 1 (F), . . . , R Pα M (F)(7)
where
R Pα (F) := E (x,y)∼Pα 1 M M i=1 ℓ(f (i) (x), y)
. Pursuing an approximate Pareto solution across all distributions leads to models with stronger generalization capabilities.

Section: Diverse Experts via Hypernetworks
We introduce T = 3 classifiers {f i } T i=1 as diverse experts, designed to capture different aspects of the long-tailed distribution. These experts share a common feature extractor ϕ θ : X → R d , but each utilizes a distinct classifier head g wi :
f i (x) = g wi (ϕ θ (x)), for i = 1, . . . , T (8)
To generate these diverse experts, we employ a hypernetwork h ψ . The hypernetwork takes a low-dimensional random noise vector z ∈ R k as input and generates the parameters w i ∈ R d×C for each classifier head:
w i = h ψ (z i ), where z i ∼ Dir(α), for i = 1, . . . , T (9)
Here, Dir(α) denotes the Dirichlet distribution with parameter α ∈ R k + , which helps in generating diverse and well-distributed input noise vectors. The hypernetwork h ψ itself consists of three linear layers with ReLU activations.
During training, we sample {z i } T i=1 from Dir(α) and use h ψ to generate the corresponding {w i } T i=1 . The overall loss function for a given training batch is defined as the sum of individual expert losses:
L = T i=1 L i (f i ) (10)
Each L i is the classification loss for the i-th expert f i . Following SADE, we define these losses to encourage diversity: L 1 is the standard cross-entropy loss; L 2 is the balanced softmax loss, which adjusts logits by adding the log of class prior probabilities; and L 3 is the inverse softmax loss, which adjusts logits by adding the log of prior probabilities and subtracting scaled log inverse prior probabilities. This setup ensures that experts are trained with varying inductive biases, promoting specialization.

Section: Stochastic Convex Ensemble for Diversity
Let L i (Θ, D) denote the loss function of the i-th expert f i on dataset D, where Θ = {θ, ψ} encompasses all trainable parameters (feature extractor and hypernetwork). Our objective is to jointly optimize the losses of all T experts:
min Θ T i=1 L i (Θ, D) (11)
To further promote diversity and robustness among the experts, we introduce the Stochastic Convex Ensemble (SCE) strategy. SCE aims to minimize the worst-case loss of a convex combination of experts, effectively searching for a solution that performs well even under adversarial weighting of experts:
min Θ max p∈∆ T T i=1 p i L i (Θ, D) (12)
Here, p = (p 1 , • • • , p T ) ⊤ ∈ ∆ T is a weight vector, and ∆ T := {p ∈ R T + | T i=1 p i = 1} represents the T -dimensional simplex. This objective encourages the model to be robust against different expert weightings.
Inspired by the max-min inequality, we relax the SCE objective into a more tractable form:
min Θ T i=1 L i (Θ, D) + λ • log T i=1 exp 1 λ L i (Θ, D) (13)
where λ > 0 is a hyperparameter controlling the strength of the diversity regularization. As λ → 0, this relaxed objective asymptotically approaches the original SCE objective. The added regularization term, λ • log T i=1 exp 1 λ L i (Θ, D), explicitly promotes diversity among the experts by penalizing scenarios where one expert significantly outperforms others, thus encouraging a more balanced performance across the ensemble.

Section: Preference-Controlled Trade-off
A key advantage of our proposed paradigm is the ability to control the trade-off between head and tail classes during inference using a user-defined preference vector. This preference vector, denoted as
α * = (α * 1 , α * 2 , α * 3 ) ⊤ ∈ ∆ 3
, where ∆ 3 is the 3-dimensional simplex, allows users to explicitly specify their desired emphasis on different class groups.
Given a pre-trained preference vector r = (r 1 , r 2 , r 3 ) ⊤ ∈ ∆ 3 , we compute the test-time adjusted preference vector r ∈ ∆ 3 as:
r = r ⊙ α * / (r ⊤ α *) (14)
where ⊙ denotes the Hadamard (element-wise) product. This normalization ensures that the adjusted preference vector r remains within the simplex. The adjusted preference vector r is then fed into the hypernetwork h ψ to dynamically generate the classifier head parameters for each expert:
Ŵi = h ψ (r), for i = 1, • • • , T (15)
Here, Ŵi ∈ R d×C represents the weight matrix for the i-th expert's classifier head. For a given test sample with feature vector x ∈ R d , the output of the i-th expert is computed as:
ŷi =      x ⊤ / ∥x∥ 2 • Ŵi / ∥ Ŵi ∥ F , if normalized feature x ⊤ Ŵi + b⊤ i , otherwise (16)
where ∥ • ∥ F is the Frobenius norm, and bi ∈ R C is the bias vector for the i-th expert. By simply adjusting α * , users can control the model's focus on head or tail classes, enabling flexible trade-offs to suit diverse application needs without requiring model retraining.
To illustrate the effectiveness of this preference control, Figure 2 demonstrates how our method can overcome distribution shifts and provide flexibility. The coordinate system for preferences is three-dimensional orthogonal, while for accuracy, it represents performance on the forward50, uniform, and backward50 splits of the CIFAR100-LT dataset. The dark plane illustrates the space of different preference vectors, and the outer surface shows the corresponding performance across these three distributions. Yellow dots represent results from SADE, which lacks controllable preferences, thus yielding random performance points that lie below our purple plane, indicating that SADE's performance is dominated by our method in the Pareto optimal set. This figure visually confirms that our method can adapt to unknown distributions without additional training and, unlike previous methods, allows for performance trade-offs by adjusting the preference vector. A more in-depth analysis of this mechanism is provided in the experimental section.

Section: Experiments
In this section, we first evaluate the superiority of PRL in terms of both standard and test-agnostic long-tailed recognition to demonstrate that our method has a higher performance ceiling under the traditional setup. Then, we analyze the effectiveness of our method in changing the trade-off for long-tailed classes through input preferences. Furthermore, we conduct necessary ablation studies.

Section: Experimental Setups
Datasets. We evaluate our method on four widely-used benchmark datasets: ImageNet-LT [20], CIFAR100-LT [4], Places-LT [20], and iNaturalist 2018 [29]. These datasets exhibit varying imbalance ratios, ranging from 10 to 256, providing a comprehensive testbed. CIFAR100-LT is evaluated with three different imbalance ratios (IR=10, 50, and 100). Detailed statistics for all datasets are provided in Appendix D.
Baselines. We conduct comprehensive comparisons against various state-of-the-art long-tailed recognition methods. These include two-stage methods (MiSLAS [41]), logit-adjusted training methods (Balanced Softmax [15], LADE [12]), ensemble learning approaches (RIDE [32], SADE [38]), causal inference models (Causal [28]), representation learning techniques (LSC [33]), and balanced posterior averaging methods (BalPoE [1]). Each of these baselines addresses the long-tail problem from distinct perspectives. Further details on these methods can be found in Appendix A.
Evaluation Protocols and Implementation Details. We evaluate all models using micro accuracy on multiple test datasets, each featuring different class distributions. Performance is reported for many-shot, medium-shot, and few-shot classes to provide a granular view of class-wise performance. For fair comparison, we maintain a consistent backbone architecture across all methods for each dataset: ResNeXt-50 for ImageNet-LT, ResNet-32 for CIFAR100-LT, ResNet-152 for Places-LT, and ResNet-50 for iNaturalist 2018. Our method employs hypernetworks (implemented as MLPs) to generate the trainable parameters of the expert classifier heads, and we utilize a cosine classifier for final predictions. Unless otherwise specified, we use α = 1.2 for the Dirichlet distribution, µ = 0.3 for stochastic annealing, and train for 200 epochs using SGD with a momentum of 0.9. The initial learning rate is set to 0.1 with a linear decay schedule. During the test-time adaptation phase, aggregation weights are trained for 5 epochs with a batch size of 128, using the same optimizer and learning rate as in the main training phase. Additional implementation details are provided in Appendix G.

Section: Comparative Evaluation on Standard and Test-Agnostic Long-Tailed Recognition
We conduct extensive experiments on four widely-used long-tailed datasets, including CIFAR100-LT, Places-LT, iNaturalist 2018, and ImageNet-LT, to evaluate the performance of our proposed PRL method in comparison with state-of-the-art approaches.
Results on standard long-tailed recognition. Table 1 demonstrates the effectiveness of our proposed method, PRL, on four benchmark datasets under the standard long-tailed recognition setting, where the test class distribution is uniform. PRL consistently achieves the highest top-1 accuracy across all datasets, outperforming the previous state-of-the-art methods, LSC [33] and BalPoE [1]. On CIFAR100-LT, PRL improves the accuracy by 0.6% to 0.8% compared to LSC and BalPoE, showcasing its robustness to different imbalance ratios (IR=10, 50, and 100). Similarly, on Places-  LT, iNaturalist 2018, and ImageNet-LT, PRL obtains the best performance, surpassing the existing methods by a clear margin. The superior performance of PRL in the standard long-tailed recognition setting validates the efficacy of our approach in mitigating the bias towards head classes and improving the recognition accuracy of tail classes.
Results on distribution-shift long-tailed recognition. We evaluate the performance of PRL and other methods in the distribution-shift long-tailed recognition setting, where the test class distribution is unknown and different from the training distribution. Tables 2 and3 show the Top-1 accuracy results on various test class distributions (including forward LT, uniform, and backward LT) for CIFAR100-LT (IR=100) and ImageNet-LT, respectively.
Figure 4: A more comprehensive example of how preference influences performance.
On different test distributions, PRL consistently outperforms all compared methods. Specifically, on CIFAR100-LT (IR=100), PRL achieves the highest accuracy across all settings, surpassing strong baselines like LSC [33], BalPoE [1], and SADE [38]. Even under the most challenging backward LT distribution, PRL maintains its outstanding performance. Similarly, on ImageNet-LT, PRL obtains the best results across all test distributions, significantly outperforming LSC, BalPoE, and SADE. The consistent improvements achieved by PRL highlight its higher performance ceiling and demonstrate the effectiveness of our method design in robustly overcoming distribution shifts. Detailed distribution-shift experimental results for Places-LT and iNaturalist 2018 can be found in Appendix F.

User preference control. We extensively evaluated PRL's performance on many-shot, medium-shot, and few-shot classes on CIFAR100-LT under various preference settings, represented by R vectors (e.g., R=(1.0, 2.7), R=(0.5, 2.5), R=(1.9, 1.1)). As shown in Table 4, by adjusting the preference value R, we can effectively control the trade-off between many-shot and few-shot classes. For instance, R=(1.0, 2.7) yields the best performance on many-shot classes, while R=(1.9, 1.1) prioritizes few-shot classes, albeit with a slight drop in many-shot performance. An intermediate setting like R=(0.5, 2.5) achieves optimal performance on medium-shot classes, demonstrating the method's ability to balance performance across different class frequency groups. Figure 3 visually analyzes these performance trade-offs across three different distributions, clearly illustrating how our preference control mechanism enables flexible adjustment of model behavior. Figure 4 provides a more comprehensive visualization of performance on head classes under the forward50 distribution, showing how specific preference positions (red dots for improvement, green for degradation) in polar coordinates influence head class accuracy. These results unequivocally demonstrate the effectiveness of our method in controlling long-tailed class trade-offs based on user preferences. This ability to adapt to different application scenarios and requirements by adjusting preferences, without requiring model retraining, is a significant advancement in achieving desired trade-offs in long-tailed recognition tasks.

Ablation study. We conducted ablation studies on CIFAR100-LT to dissect the impact of key components: removing the hypernetwork (w.o. hnet) and removing the Chebyshev polynomial (w.o. stch). Figure 5 illustrates the performance under different unknown test class distributions. The complete PRL model consistently outperforms its ablated versions across all distributions. Removing either the hypernetwork or the Chebyshev polynomial leads to noticeable performance degradation, underscoring their critical importance in dynamically adjusting model behavior to adapt to distribution shifts and learning preference-aware representations. This ablation study validates the synergistic effectiveness of our method's components in handling unknown test distributions and mitigating data imbalance issues in long-tailed recognition.

Section: Conclusion
This study introduces a novel long-tailed learning paradigm to address distribution shifts between training and testing datasets. Our hypernetwork-based approach generates adaptable classifiers, achieving Pareto optimality for real-time adaptation. During inference, the model adjusts based on user-defined trade-offs between head and tail classes, enhancing flexibility. Empirical results show improved accuracy and adaptability to class imbalances and distribution shifts. Our work establishes an interpretable, generalizable, and controllable framework for long-tailed learning, meeting diverse user needs.

Section: B.1.1 Resampling Methods
•   **Oversampling** [5,9]: Generates synthetic examples for minority classes.
    -   *Advantages*: Alleviates imbalance by increasing tail class samples.
    -   *Disadvantages*: Can lead to overfitting on synthetic data and high computational cost.
•   **Undersampling** [7,2]: Removes examples from majority classes.
    -   *Advantages*: Simple and efficient approach to balance classes.
    -   *Disadvantages*: Discards potentially valuable head class information, leading to reduced overall performance.

Section: B.1.2 Loss Adjustment Methods
•   **Focal Loss** [17] and variants [6]:
    -   *Mechanism*: Imposes larger penalties on well-classified examples, encouraging the model to focus on hard samples, particularly from tail classes.
    -   *Disadvantages*: Requires careful tuning of focusing parameters and may not generalize across all imbalance ratios.
•   **Class-Balanced Loss** [6]:
    -   *Mechanism*: Re-weights loss based on the effective number of samples per class, giving more importance to tail classes.
    -   *Disadvantages*: Assumes equal importance of classes, which may not hold in practice, and can sometimes overcompensate.
•   **LDAM (Label-Distribution-Aware Margin Loss)** [4]:
    -   *Mechanism*: Explicitly models each example's contribution to the gradient direction by introducing class-dependent margins.
    -   *Disadvantages*: Requires additional hyperparameters and can involve complex optimization, making it harder to implement and tune.

Section: B.1.3 Module Improvement Methods
•   **Decoupled Learning** [16,43]:
    -   *Mechanism*: Separates representation learning from classifier training to mitigate bias towards head classes and achieve better feature extraction.
    -   *Disadvantages*: Requires architectural changes, and the two-stage process may not generalize well across all datasets or tasks.
•   **Few-Shot Experts** [32]:
    -   *Mechanism*: Employs additional specialized experts specifically designed to handle few-shot classes, often trained with different strategies.
    -   *Disadvantages*: Increases model complexity and training difficulty due to managing multiple expert networks.
•   **Self-Supervised Pretraining** [14]:
    -   *Mechanism*: Leverages self-supervision to learn robust and balanced feature representations before fine-tuning on the long-tailed dataset.
    -   *Disadvantages*: Requires additional pretraining resources, and the benefits may be highly task-specific.

Section: B.1.4 Transfer Learning Methods
•   **Data-Based Transfer** [20,14]:
    -   *Mechanism*: Techniques like knowledge distillation and feature transformation are used to transfer knowledge from head classes to tail classes.
    -   *Disadvantages*: Assumes head and tail distributions are sufficiently related, and may suffer from negative transfer if this assumption is violated.
•   **Model-Based Transfer** [36]:
    -   *Mechanism*: Utilizes models pretrained on more abundant head classes to facilitate the learning of tail classes.
    -   *Disadvantages*: Similar to data-based transfer, it assumes related head and tail distributions, which can lead to suboptimal performance if the domains diverge significantly.
Despite significant progress, existing LTL methods face inherent limitations in addressing the trade-off between head and tail classes, handling diverse distribution shifts, and accommodating varying user preferences. To overcome these critical issues, we formulate long-tailed learning as a multi-objective optimization problem and propose a novel hypernetwork-based diverse expert learning paradigm, achieving interpretable and controllable solutions tailored to user needs under arbitrary test distribution shifts.

Section: B.2 Related Work for Multi-Objective Optimization and Hypernetworks


Section: B.2.1 Multi-Objective Optimization
Let X ⊆ R n be the decision space and consider m objective functions f i : X → R, i = 1, . . . , m to be minimized simultaneously. The multi-objective optimization problem (MOP) can be stated as:
min x∈X {f 1 (x), . . . , f m (x)}(17)
In general, there does not exist a single solution x * ∈ X that minimizes all objectives simultaneously due to the conflicting nature of the objectives. Instead, the solution concept is that of Pareto optimality [22]. Definition 3 (Pareto Optimality). A solution x * ∈ X is Pareto optimal if there does not exist another
x ∈ X such that f i (x) ≤ f i (x * ) for all i = 1, . . . , m and f j (x) < f j (x * ) for at least one j.
The set of all Pareto optimal solutions is called the Pareto set, and its image in the objective space is the Pareto front. The goal in MOPs is to approximate the Pareto front as well as possible.

Section: B.2.2 Chebyshev Scalarization
A common approach to approximate the Pareto front is through scalarization methods that transform the MOP into a scalar optimization problem [22,19]. The weighted Chebyshev scalarization is defined as:
min x∈X max 1≤i≤m w i (f i (x) -z * i )(18)
where w = (w 1 , . . . , w m ) T ∈ R m + is a weight vector with m i=1 w i = 1, and z * = (z * 1 , . . . , z * m ) T is a utopian reference point [11]. By varying w, different Pareto optimal solutions can be obtained.

Section: B.2.3 Hypernetworks for Multi-Objective Optimization
Hypernetworks [18] offer a promising approach for multi-objective optimization of neural networks. A hypernetwork h ϕ : Z → Θ is a neural network that takes a low-dimensional input z ∈ Z and outputs the parameters θ ∈ Θ of a target neural network f θ : X → Y. By sampling different z ∈ Z, the hypernetwork generates an ensemble {f θi } i where θ i = h ϕ (z i ). This ensemble can effectively approximate the Pareto front of the multi-objective optimization problem, which is defined as:
min θ∈Θ {L 1 (f θ ), . . . , L m (f θ )} (19)
Here, L i : Θ → R are loss functions corresponding to the m distinct objectives. The hypernetwork parameters ϕ can be optimized using various scalarization methods, such as the Chebyshev method:
min ϕ E w∼p(w) [max 1≤i≤m w i (L i (f h ϕ (z) ) -z * i )] (20)
where p(w) is a distribution over weight vectors w, and z* is a reference point. This approach enables learning a diverse set of target networks that collectively approximate the Pareto front in a flexible and controllable manner, which is crucial for our controllable long-tailed learning paradigm.

Section: C Proof of Propositions C.1 Proof of Theorem 1
Proof. The risk of classifier f on the test environment E test is defined as:
R test (f ) = E (x,y)∼Ptest(x,y) [ℓ(f (x; θ), y)](21)
Using the law of total expectation, we can decompose the risk as:
R test (f ) = K i=1 π test i • E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)](22)
Similarly, the risk of classifier f on the training environment E m can be expressed as:
R m (f ) = K i=1 π m i • E x∼Pm(x|y=i) [ℓ(f (x; θ), i)](23)
Since the classifier f is learned via ERM on E m , we have P m (x|y = i) = P test (x|y = i) for all i ∈ {1, . . . , K}. Therefore,
R test (f ) = K i=1 π test i • E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)](24)
= K i=1 π m i • E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)] + K i=1 (π test i -π m i ) • E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)] = R m (f ) + K i=1 (π test i -π m i ) • E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)]
This completes the proof.

Section: C.2 Proof of Corollary 1
Proof. From Theorem 1, we have:
R test (f ) = R m (f ) + K i=1 (π test i -π m i ) • E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)](25)
Using the definition of M , we can bound the expectation term:
E x∼Ptest(x|y=i) [ℓ(f (x; θ), i)] ≤ M (26) Therefore, R test (f ) ≤ R m (f ) + K i=1 (π test i -π m i ) • M = R m (f ) + M • K i=1 |π test i -π m i | = R m (f ) + 2M • δ(E m , E test )
where the last equality follows from the definition of the total variation distance δ(E m , E test ) =
1 2 K i=1 |π m i -π test i |. Next, we use the triangle inequality to bound δ(E m , E test ): δ(E m , E test ) ≤ δ(E m , E j ) + δ(E j , E test ) ≤ max i,j∈{1,...,M } δ(E i , E j ) + δ(E j , E test ) ≤ ∆(E 1 , . . . , E M ) + δ(E j , E test )
for any j ∈ {1, . . . , M }. By taking the average over all j, we obtain:
δ(E m , E test ) ≤ ∆(E 1 , . . . , E M ) + 1 M M j=1 δ(E j , E test )(27)
Combining this with the previous bound on R test (f ), we have:
R test (f ) ≤ R m (f ) + 2M •   ∆(E 1 , . . . , E M ) + 1 M M j=1 δ(E j , E test )  (28)
which completes the proof.

Section: C.3 Proof of Theorem 2
Proof. By definition, the risk of the ensemble classifier f on the test environment E test is:
R test ( f ) = E (x,y)∼Ptest [ℓ( f (x), y)] = 1 N N i=1 E (x,y)∼Ptest [ℓ(f i (x), y)] = 1 N N i=1 R test (f i )
Applying Corollary 1, we have:
R test ( f ) ≤ 1 N N i=1 R m(i) (f i ) + 2M • (δ(E m(i) , E test ) + ∆(E 1 , . . . , E N )) = 1 N N i=1 R m(i) (f i ) + 2M N N i=1 δ(E m(i) , E test ) + 2M ∆(E 1 , . . . , E N )
where m(i) denotes the index of the training environment used to learn expert f i . Now, we bound the second term:
1 N N i=1 δ(E m(i) , E test ) = 1 N N m=1 i:m(i)=m δ(E m , E test ) ≤ 1 N N m=1 N m • δ(E m , E test ) ≤ 1 N N m=1 N • δ(E m , E test ) = N m=1 δ(E m , E test )
where N m is the number of experts learned from environment E m , and we used the fact that N m=1 N m = N . Finally, we have:
R test ( f ) ≤ 1 N N m=1 R m (f m ) + 2M • 1 N N m=1 δ(E m , E test ) + ∆(E 1 , . . . , E N ) = 1 N N m=1 R m (f m ) + 2M • 1 N N m=1 δ(E m , E test ) + N -1 N ∆(E 1 , . . . , E N )
where the last equality follows from the definition of ETVD.
Next, we explain the connection between the theoretical analysis and our proposed method. The theoretical results demonstrate that, in long-tailed learning, introducing multiple training environments and minimizing empirical risks across these environments to learn a set of diverse experts can effectively address the problem of distribution shift between training and test environments, leading to better generalization performance. These theoretical insights provide important guidance for the design and further improvement of our algorithm.

Section: D Datasets
To thoroughly evaluate the effectiveness and generalization capabilities of our proposed method, we conduct extensive experiments on four widely-recognized long-tailed datasets: CIFAR100-LT, ImageNet-LT, iNaturalist 2018, and Places365-LT. These datasets represent a diverse range of domains and exhibit varying degrees of class imbalance, thereby providing a comprehensive and challenging testbed for long-tailed learning algorithms.
•   **ImageNet-LT** [20]: This dataset is a long-tailed subset derived from the large-scale ImageNet dataset. It comprises over 115,000 images distributed across 1,000 classes. The class cardinalities follow a Pareto distribution with a parameter α = 6, resulting in a significant maximum imbalance ratio of 256.
•   **iNaturalist 2018** [29]: A real-world dataset characterized by a naturally occurring long-tailed distribution. It contains approximately 450,000 images spanning 8,142 distinct species. The number of images per species varies drastically, with an extreme imbalance ratio reaching up to 500. This dataset poses a substantial challenge due to its severe class imbalance and high intra-class variation.
•   **Places365-LT**: This is a long-tailed variant of the Places365 dataset [42], which originally consists of over 1.8 million images categorized into 365 scene classes. We induce a long-tailed distribution by randomly subsampling images for each class, achieving an imbalance ratio of approximately 50. This dataset is particularly challenging given the large number of classes and the inherent visual ambiguity in scene recognition tasks.
•   **CIFAR100-LT** [4]: This dataset is a long-tailed version of the standard CIFAR100 dataset. We evaluate our method on three distinct versions of CIFAR100-LT, corresponding to imbalance ratios (IR) of 10, 50, and 100. These controlled settings allow for a systematic analysis of our method's performance under different levels of data imbalance.
Detailed statistics for all datasets, including class distributions and sample counts, are provided in Table 5.

Section: E Pseudo Code
Here are pseudo codes explaining the core aspects of our method:
Algorithm 1: Diverse Expert Learning with Hypernetworks
Input: Training data with long-tailed distribution D_train
Output: Ensemble of expert models E = {E_1, E_2, ..., E_K}

1: Initialize shared feature extractor F
2: Initialize hypernetwork H_θ with weights θ
3: Initialize expert loss function L (e.g., DiverseExpertLoss)

4: for epoch = 1 to max_epochs do
5:   for each batch B ⊆ D_train do
6:     f = F(x) {Shared feature extraction}
7:     for k = 1 to K do
8:       z_k ~ Dir(α) {Sample input for hypernetwork}
9:       ϕ_k = H_θ(z_k) {Generate expert weights ϕ_k from hypernetwork}
10:      E_k = G_ϕ_k(f) {Obtain expert predictions using ϕ_k, where G is the classifier head}
11:      L_k = L(E_k, y, extra_info) {Compute expert losses}
12:    end for
13:    L_div = DiversityLoss(E) {Encourage expert diversity, e.g., using SCE regularization}
14:    L_total = Σ_k L_k + λ * L_div {Total loss for updating feature extractor and hypernetwork weights}
15:    θ = θ - η∇_θ L_total {Update feature extractor and hypernetwork weights}
16:  end for
17: end for

Section: F Results on ImageNet-LT and iNaturalist 2018 Datasets
On the representative Places-LT dataset, our PRL method achieves the best Top-1 accuracy under various unknown test class distributions. Specifically, in the Forward-LT setting, as the proportion of unknown classes decreases from 50% to 2%, the Top-1 accuracy of PRL drops from 47.9% to 42.8%, but still significantly outperforms other baseline methods. Under the Uniform distribution, PRL reaches the highest accuracy of 41.9%. In the Backward-LT setting, PRL's accuracy gradually increases from 41.7% to 44.1%, again surpassing all counterpart methods. These consistent results thoroughly validate the outstanding performance and robustness of our method in handling diverse unknown class distributions.
On the iNaturalist 2018 dataset, PRL also exhibits excellent performance. In the Forward-LT setting, when the proportion of unknown classes decreases from 3 to 2, PRL's Top-1 accuracy slightly increases from 73.7% to 73.8%, and reaches the best performance of 74.3% under the Uniform distribution. In the Backward-LT setting, although PRL's accuracy slightly decreases from 74.0% to 73.9%, it consistently outperforms all comparison methods. These results further confirm the broad effectiveness and generalization capability of our method across different datasets and scenarios.
Overall, by successfully tackling the challenges of long-tailed distributions and unknown class distributions, the PRL method consistently demonstrates superior performance on these representative long-tailed datasets, thereby validating the robustness and effectiveness of our approach.

Section: G Complexity Analysis
The hypernetwork is responsible for outputting the trainable parameters D for each expert classifier head. In this case, the number of parameters in the hypernetwork becomes E × D × K, where E is the number of input channels to the output layer of the hypernetwork, and K is the number of experts. Thus, the time complexity of the hypernetwork is O(D).
If the total number of parameters in the model without the hypernetwork is O(N ), and the computational complexity of the model's main operations is O(M × N ), where M > 1 is a complexity factor, then when N ≫ D, the overall time complexity becomes
O(D) + O(M × N ) = O(M × N ).
Therefore, while the hypernetwork increases the total number of parameters, its impact on the overall computational complexity is relatively small, especially when the number of parameters D generated by the hypernetwork is much smaller than the total number of parameters N in the main model. As can be seen from the table, the introduction of hypernetworks results in almost no increase in GFLOPs (floating-point operations) across different models.

Section: H Limitations
While our proposed novel approach, utilizing a hypernetwork to generate multiple diverse expert models, demonstrates significant potential in enabling controllable adjustment of head and tail class weights for long-tailed datasets and improving robustness to distribution shifts, its introduction also presents new challenges for model training and convergence. As an additional neural network module, the hypernetwork generates weight parameters for the classifier heads of each expert, which can significantly increase the total number of trainable parameters. This increase may affect training stability and convergence speed. We provide an analysis of its computational overhead in Section G, but further research into more efficient training strategies and ensuring stable controllability under various complex scenarios is still necessary.

Section: I Broader Impacts
Our proposed novel approach, which generates multiple expert models via hypernetworks, enables dynamic adjustment of head and tail class weights for long-tailed datasets and significantly improves model robustness to distribution shifts. This enhanced flexibility and robustness are of considerable value across a multitude of practical applications, particularly in fields where data imbalance and dynamic test conditions are prevalent (e.g., medical diagnosis, ecological monitoring, fraud detection). This work provides a new and viable solution to the critical challenges of long-tailed distributions and distribution shifts. By enhancing the generalization capabilities and practical applicability of existing models, our research holds promise to contribute positively to technological advancements in relevant fields, fostering more equitable and effective AI systems.

Section: NeurIPS Paper Checklist


Section: Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?
Answer: [Yes] Justification: We describe the contributions of this paper in detail.

Section: Guidelines:
• The answer NA means that the abstract and introduction do not include the claims made in the paper. • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

Section: Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes] Justification: We discuss the limitations of this paper in the appendix.
Guidelines:
• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. • The authors are encouraged to create a separate "Limitations" section in their paper.
• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. • The authors should reflect on the factors that influence the performance of the approach.
For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

Section: Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Answer: [Yes]
Answer: [Yes] Justification: We provide the complete code as well as the details of the publicly available datasets used.
Guidelines:
• The answer NA means that paper does not include experiments requiring code. • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

Section: Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
Answer: [Yes]
Justification: We provide the relevant details and analysis of the experimental results.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. • The full details can be provided either with the code, in appendix, or as supplemental material.

Section: Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
Answer: [Yes] Justification: We reported the average results of multiple tests in the experiment.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. • The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) • The assumptions made should be given (e.g., Normally distributed errors).
• It should be clear whether the error bar is the standard deviation or the standard error of the mean. • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

Section: Experiments Compute Resources
Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
Answer: [Yes]
Justification: We have provided the code for easy reproduction.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper).

Section: Code Of Ethics
Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
Answer: [Yes] Justification: We are fully qualified.
Guidelines:
• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
• If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

Section: Broader Impacts
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
Answer: [Yes] Justification: We discuss the implications in the appendix.
Guidelines:
• The answer NA means that there is no societal impact of the work performed.
• If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
• The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

Section: Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [NA] Justification: Our paper poses no such risks.
Guidelines:
• The answer NA means that the paper poses no such risks.
• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [Yes] Justification: We give references to the datasets and code used.
Guidelines:
• The answer NA means that the paper does not use existing assets.
• The authors should cite the original paper that produced the code package or dataset.
• The authors should state which version of the asset is used and, if possible, include a URL. • The name of the license (e.g., CC-BY 4.0) should be included for each asset.
• For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

Section: Acknowledgements
This work was supported by the Natural Science Foundation of China Youth Project (No. 62402472), the Natural Science Foundation of Jiangsu Province of China Youth Project (No. BK20240461), the Research Grants Council of the Hong Kong Special Administrative Region, China (GRF Project No. CityU 11215723), National Natural Science Foundation of China (No.62072427, No.12227901), the Project of Stable Support for Youth Team in Basic Research Field, CAS (No.YSBR-005), and Academic Leaders Cultivation Program, USTC.

Section: Appendix Breaking Long-Tailed Learning Bottlenecks: A Controllable Paradigm with Hypernetwork-Generated Diverse Experts
A Baselines Details
In this section, we provide a comprehensive overview of some state-of-the-art methods for long-tailed recognition, which will serve as baselines for comparison with our proposed PRL approach.
• Two-stage methods decouple representation learning and classifier training to mitigate the bias towards head classes. MiSLAS [41] introduces a mixup-based strategy in the second stage to enhance the learning of tail classes. By separating the learning process, these methods can alleviate the negative impact of imbalanced data on feature extraction.
• Logit-adjusted training methods focus on modifying the logits during training to address class imbalance. Balanced Softmax [15] introduces a class-balanced term to the softmax function, which adaptively adjusts the logits based on the sample frequencies. LADE [12] disentangles the learning of feature representations and classifier by adding a learnable logit adjustment term. These methods effectively prevent the model from being biased towards head classes.
• Ensemble learning methods leverage multiple classifiers or experts to capture the diversity of the data. RIDE [32] trains multiple experts with different resampling strategies and dynamically combines their outputs based on the sample distributions. SADE [38] further improves upon RIDE by introducing a self-adaptive distillation mechanism to transfer knowledge among experts. By exploiting the diversity of experts, these methods can better handle imbalanced data.
• Causal inference methods aim to address the long-tail problem by designing causal classifiers. Causal [28] proposes a causal inference framework that identifies the causal effect of each class on the predictions, thus reducing the bias introduced by the imbalanced data distribution.
• Representation learning methods tackle long-tailed recognition by learning more balanced and discriminative features. LSC [33] introduces a contrastive learning framework that balances the instance-level and group-level distributions simultaneously, leading to more effective representations for tail classes.
• Balanced posterior averaging methods focus on combining the predictions of multiple experts based on their posterior probabilities. BalPoE [1] proposes a balanced posterior averaging strategy that assigns higher weights to experts with better performance on tail classes, thus achieving a better trade-off between head and tail classes.
While these methods have made significant progress in addressing the long-tail problem, they often rely on specific assumptions about the data distributions during training or testing, limiting their applicability in real-world scenarios. Moreover, most of these methods do not provide a mechanism for users to control the trade-off between head and tail classes based on their specific needs. In contrast, our proposed PRL approach overcomes these limitations by learning a diverse set of experts that can adapt to various test distributions without any prior assumptions, while also enabling interpretable and controllable trade-offs through Pareto optimization.
B Supplementary materials for related work
Long-tailed distributions, where a few classes (heads) have abundant samples while many classes (tails) have few samples, are widespread in real-world data [30,20]. This imbalance poses significant challenges for machine learning models, which tend to perform poorly on tail classes. To address this issue, various long-tailed learning (LTL) methods have been proposed.
Justification: We provide a complete demonstration process and data support. Guidelines:
• The answer NA means that the paper does not include theoretical results.
• All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. • All assumptions should be clearly stated or referenced in the statement of any theorems.
• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. • Theorems and Lemmas that the proof relies upon should be properly referenced.

Section: Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer: [Yes] Justification: We provide implementation details and code. Guidelines:
• The answer NA means that the paper does not include experiments.
• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. • Depending on the contribution, reproducibility can be accomplished in various ways.
For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. , with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility.
In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

Section: Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
• If this information is not available online, the authors are encouraged to reach out to the asset's creators. • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.


References:
[b0] Emanuel Sanchez Aimar; Arvi Jonnarth; Michael Felsberg; Marco Kuhlmann (2023). Balanced product of calibrated experts for long-tailed recognition. 
[b1] Mateusz Buda; Atsuto Maki; Maciej A Mazurowski (2018). A systematic study of the class imbalance problem in convolutional neural networks. Neural networks
[b2] Jonathon Byrd; Zachary Lipton (2019). What is the effect of importance weighting in deep learning?. PMLR
[b3] Kaidi Cao; Colin Wei; Adrien Gaidon; Nikos Arechiga; Tengyu Ma (2019). Learning imbalanced datasets with label-distribution-aware margin loss. 
[b4] Kevin W Nitesh V Chawla; Lawrence O Bowyer; Philip Hall;  Kegelmeyer (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research
[b5] Yin Cui; Menglin Jia; Tsung-Yi Lin; Yang Song; Serge Belongie (2019). Class-balanced loss based on effective number of samples. 
[b6] Chris Drummond; Robert C Holte (2003). C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. 
[b7] Kai Gan; Tong Wei (2024). Erasing the bias: Fine-tuning foundation models for semi-supervised learning. 
[b8] Hui Han; Wen-Yuan Wang; Bing-Huan Mao (2005). Borderline-smote: a new over-sampling method in imbalanced data sets learning. Springer
[b9] Haibo He; Edwardo A Garcia (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering
[b10] Claus Hillermeier (2001). Nonlinear multiobjective optimization: a generalized homotopy approach. Springer Science & Business Media
[b11] Youngkyu Hong; Seungju Han; Kwanghee Choi; Seokjun Seo; Beomsu Kim; Buru Chang (2021). Disentangling label distribution for long-tailed visual recognition. 
[b12] Chen Huang; Yining Li; Chen Change Loy; Xiaoou Tang (2016). Learning deep representation for imbalanced classification. 
[b13] Muhammad Abdullah; Jamal ; Matthew Brown; Ming-Hsuan Yang; Liqiang Wang; Boqing Gong (2020). Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. 
[b14] Ren Jiawei; Cunjun Yu; Xiao Ma; Haiyu Zhao; Shuai Yi (2020). Balanced meta-softmax for long-tailed visual recognition. 
[b15] Bingyi Kang; Saining Xie; Marcus Rohrbach; Zhicheng Yan; Albert Gordo; Jiashi Feng; Yannis Kalantidis (2019). Decoupling representation and classifier for long-tailed recognition. 
[b16] Tsung-Yi Lin; Priya Goyal; Ross Girshick; Kaiming He; Piotr Dollár (2017). Focal loss for dense object detection. 
[b17] Xi Lin; Zhiyuan Yang; Qingfu Zhang; Sam Kwong (2020). Controllable pareto multi-task learning. 
[b18] Xi Lin; Xiaoyuan Zhang; Zhiyuan Yang; Fei Liu; Zhenkun Wang; Qingfu Zhang (2024). Smooth tchebycheff scalarization for multi-objective optimization. 
[b19] Ziwei Liu; Zhongqi Miao; Xiaohang Zhan; Jiayun Wang; Boqing Gong; Stella X Yu (2019). Largescale long-tailed recognition in an open world. 
[b20] Aditya Krishna Menon; Sadeep Jayasumana; Ankit Singh Rawat; Himanshu Jain; Andreas Veit; Sanjiv Kumar (2020). Long-tail learning via logit adjustment. 
[b21] Kaisa Miettinen (1999). Nonlinear multiobjective optimization. Springer Science & Business Media
[b22] Mengye Ren; Eleni Triantafillou; Sachin Ravi; Jake Snell; Kevin Swersky; Joshua B Tenenbaum; Hugo Larochelle; Richard S Zemel (2018). Meta-learning for semi-supervised few-shot classification. 
[b23] Li Shen; Zhouchen Lin; Qingming Huang (2016). Relay backpropagation for effective learning of deep convolutional neural networks. Springer
[b24]  Jiang-Xin; Tong Shi; Yuke Wei; Yu-Feng Xiang;  Li (2023). How re-sampling helps for long-tail learning?. Advances in Neural Information Processing Systems
[b25] Jun Shu; Qi Xie; Lixuan Yi; Qian Zhao; Sanping Zhou; Zongben Xu; Deyu Meng (2019). Metaweight-net: Learning an explicit mapping for sample weighting. Advances in neural information processing systems
[b26] Jingru Tan; Changbao Wang; Buyu Li; Quanquan Li; Wanli Ouyang; Changqing Yin; Junjie Yan (2020). Equalization loss for long-tailed object recognition. 
[b27] Kaihua Tang; Jianqiang Huang; Hanwang Zhang (2020). Long-tailed classification by keeping the good and removing the bad momentum causal effect. Advances in neural information processing systems
[b28] Grant Van Horn; Oisinand Mac Aodha (2018). The inaturalist species classification and detection dataset. 
[b29] Grant Van Horn; Pietro Perona (2017). The devil is in the tails: Fine-grained classification in the wild. 
[b30] Kun Wang; Hao Wu; Guibin Zhang; Junfeng Fang; Yuxuan Liang; Yuankai Wu; Roger Zimmermann; Yang Wang (2024). Modeling spatio-temporal dynamical systems with neural discrete learning and levels-of-experts. IEEE Transactions on Knowledge and Data Engineering
[b31] Xudong Wang; Long Lian; Zhongqi Miao; Ziwei Liu; Stella X Yu (2021). Long-tailed recognition by routing diverse distribution-aware experts. 
[b32] Tong Wei; Zhen Mao; Zi-Hao Zhou; Yuanyu Wan; Min-Ling Zhang (2023). Learning label shift correction for test-agnostic long-tailed recognition. 
[b33] Wei Xu; Pengkun Wang; Zhe Zhao; Binwu Wang; Xu Wang; Yang Wang (2024). When imbalance meets imbalance: Structure-driven learning for imbalanced graph classification. 
[b34] Yuzhe Yang; Zhi Xu (2020). Rethinking the value of labels for improving class-imbalanced learning. 
[b35] Xiang Xi Yin; Kihyuk Yu; Xiaoming Sohn; Manmohan Liu;  Chandraker (2019). Feature transfer learning for face recognition with under-represented data. 
[b36] Songyang Zhang; Zeming Li; Shipeng Yan; Xuming He; Jian Sun (2021). Distribution alignment: A unified framework for long-tail visual recognition. 
[b37] Yifan Zhang; Bryan Hooi; Lanqing Hong; Jiashi Feng (2022). Self-supervised aggregation of diverse experts for test-agnostic long-tailed recognition. Advances in Neural Information Processing Systems
[b38] Yifan Zhang; Bingyi Kang; Bryan Hooi; Shuicheng Yan; Jiashi Feng (2023). Deep long-tailed learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence
[b39] Zhe Zhao; Pengkun Wang; Haibin Wen; Wei Xu; Song Lai; Qingfu Zhang; Yang Wang (). Two fists, one heart: Multi-objective optimization based strategy fusion for long-tailed learning. 
[b40] Zhisheng Zhong; Jiequan Cui; Shu Liu; Jiaya Jia (2021). Improving calibration for long-tailed recognition. 
[b41] Bolei Zhou; Agata Lapedriza; Aditya Khosla; Aude Oliva; Antonio Torralba (2017). Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence
[b42] Boyan Zhou; Quan Cui; Xiu-Shen Wei; Zhao-Min Chen (2020). Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. 

Figures:
Figure fig_0: 
Type: figure
Caption: Traditional long-tailed learning method (b) Multi-expert long-tailed learning method (c) Our proposed method
Data: 

Figure fig_1: 1
Type: figure
Caption: Figure 1 :1Figure 1: Illustration of our method: (a) Existing methods train for a specific long-tailed distribution but may fail on arbitrarily skewed test distributions. (b) Multi-expert learns different experts for different distributions from one training set but lacks flexibility for arbitrary distributions/preferences. (c) Our method samples preference vectors during training to simulate distributions, and can flexibly adjust the preference vector during testing for flexible long-tailed classification.
Data: 

Figure fig_2: 11
Type: figure
Caption: Definition 1 (K k=1 π m k = 1 .11Distribution Discrepancy across Environments). Given M training environments E 1 , . . . , E M , with class prior probability vectors π 1 , . . . , π M , respectively, where π m = (π m 1 , . . . , π m K ), π m K denotes the probability of the k-th class appearing in environment E m , and If there exist i, j, k, l such that these M environments are said to have distribution discrepancy.
Data: 

Figure fig_3: 2
Type: figure
Caption: Figure 2 :2Figure 2: Mapping from preference to model properties.
Data: 

Figure fig_4: 3
Type: figure
Caption: Figure 3 :3Figure 3: Analysis of the preference control for the trade-off between head and tail class performance. We present the results on three distributions. The vertical axis represents accuracy. The horizontal axis shows the results after clustering by frequency, from head classes to tail classes (left to right).
Data: 

Figure fig_5: 5
Type: figure
Caption: Figure 5 :5Figure 5: Ablation analysis, including the ablation of the hypernetwork and Chebyshev polynomials.
Data: 

Figure fig_6: 3
Type: figure
Caption: 3 .3Theorem 2 provides a tighter upper bound on the risk of the diversity-aware ensemble classifier f in the test environment. Compared to the single-environment ERM, the average empirical risk term 1 N N m=1 R m (f m ) in the upper bound indicates that the ensemble classifier can reduce empirical risk, while the presence of 1 N N m=1 δ(E m , E test ) shows that the diversity-aware expert method, by learning a set of experts to capture the distribution characteristics of different environments, can narrow the distribution gap between the training and test environments, thereby achieving better generalization performance.
Data: 

Figure fig_7: 7
Type: figure
Caption: 7 :7for each batch B ⊆ D train do 8: f = F (x) {Shared feature extraction} 9:
Data: 

Figure tab_0: 1
Type: table
Caption: Top-1 accuracy on CIFAR100-LT, Places-LT, iNaturalist 2018, and ImageNet-LT, where the test class distribution is uniform.
Data: MethodCIFAR100-LTPlaces-LT iNaturalist 2018 ImageNet-LTIR=10 IR=50 IR=100Softmax59.145.641.431.464.748.0Causal [28]59.448.845.032.264.450.3Balanced Softmax [15]61.050.946.139.470.652.3MiSLAS [41]62.551.546.838.370.751.4LADE [12]61.650.145.639.269.352.3RIDE [32]61.851.748.040.371.856.3SADE [38]63.653.848.840.972.758.8LSC [33]65.056.551.841.373.960.2BalPoE [1]64.856.352.040.875.059.3PRL(ours)65.657.352.841.675.160.8

Figure tab_1: 2
Type: table
Caption: Top-1 accuracy on CIFAR100-LT (IR100) with various unknown test class distributions.
Data: Forward-LTUni.Backward-LTMethodPrior50251052125102550Softmax✗63.3 62.0 56.2 52.5 46.441.436.5 30.5 25.8 21.7 17.5BS✗57.8 55.5 54.2 52.0 48.746.143.6 40.8 38.4 36.3 33.7MiSLAS✗58.8 57.2 55.2 53.0 49.646.843.6 40.1 37.7 33.9 32.1LADE✗56.0 55.5 52.8 51.0 48.045.643.2 40.0 38.3 35.5 34.0LADE✓62.6 60.2 55.6 52.7 48.245.643.8 41.1 41.5 40.7 41.6RIDE✗63.0 59.9 57.0 53.6 49.448.042.5 38.1 35.4 31.6 29.2SADE✗65.2 62.5 58.8 55.4 51.248.843.0 43.9 42.4 42.2 42.0LSC✗67.8 64.2 60.2 58.1 53.251.644.7 45.7 44.2 44.7 48.0BalPoE✗69.0 65.2 61.2 59.0 54.251.745.7 46.6 45.2 45.2 45.8PRL (ours)✗69.5 65.7 61.7 59.5 54.752.246.2 47.1 45.7 45.7 48.5

Figure tab_2: 3
Type: table
Caption: Top-1 accuracy on ImageNet-LT with various unknown test class distributions.
Data: Forward-LTUni.Backward-LTMethodPrior50251052125102550Softmax✗66.1 63.8 60.3 56.6 52.048.043.9 38.6 34.9 30.9 27.6BS✗63.2 61.9 59.5 57.2 54.452.350.0 47.0 45.0 42.3 40.8MiSLAS✗61.6 60.4 58.0 56.3 53.751.449.2 46.1 44.0 41.5 39.5LADE✗63.4 62.1 59.9 57.4 54.652.349.9 46.8 44.9 42.7 40.7LADE✓65.8 63.8 60.6 57.5 54.552.350.4 48.8 48.6 49.0 49.2RIDE✗67.6 66.3 64.0 61.7 58.956.354.0 51.0 48.7 46.2 44.0SADE✗69.7 67.5 65.4 62.3 60.358.356.7 54.9 54.3 53.1 52.6LSC✗72.0 69.7 67.5 65.3 62.760.259.2 58.5 57.9 57.5 57.0BalPoE✗72.2 69.7 67.2 64.3 62.259.558.5 57.7 56.9 56.7 56.6PRL (ours)✗72.7 70.2 68.0 65.8 63.260.759.7 59.0 58.4 58.0 57.5

Figure tab_3: 4
Type: table
Caption: Control of trade-off preference for long-tailed classes with different preferences, bold text, underlined text, and dashed underline respectively indicate the highest performance of the head, middle, and tail classes in this line.
Data: Dist.R=(1.0, 2.7)R=(0.5, 2.5)R=(1.9, 1.1)Many Middle Few Many Middle Few Many Middle FewForward50 61.4 25 61.650.4 48.336.5 61.0 28.4 60.652.6 . . . . . . . . . . 49.631.5 61.1 31.5 59.748.9 49.440.3 33.1Uni161.651.433.2 61.651.533.2 61.651.433.2Backward 25 63.8 50 66.649.4 47.131.1 60.2 30.6 66.148.2 . . . . . . . . . . 48.932.1 63.2 30.9 64.648.2 47.832.2 31.7

Figure tab_4: 
Type: table
Caption: 1. Theorem 1 states that for a classifier f learned via ERM on a single environment E m , its risk on the test environment E test is influenced not only by the distribution discrepancy between the training and test environments, but also by the distribution discrepancy among the training environments (i.e., ETVD). This reveals the limitation of the single-environment ERM method. 2. Corollary 1 further quantifies an upper bound on the risk of the ERM-learned classifier in the test environment. This upper bound consists of the training risk R m (f ), the TVD δ(E m , E test ) between the training and test environments, and the ETVD ∆(E 1 , . . . , E M ) among the training environments. To overcome distribution shift, we need to learn a set of diverse expert models that can capture the distribution characteristics of different environments.
Data: 

Figure tab_5: 5
Type: table
Caption: Statistics of the long-tailed datasets.
Data: Dataset# Classes# Train# Test Imbalance RatioCIFAR100-LT10050,00010,000{10, 50, 100}ImageNet-LT1,000115,84650,000256iNaturalist 20188,142437,51324,426500Places365-LT3651,803,460 36,500~50

Figure tab_6: 
Type: table
Caption: Algorithm 1 Diverse Expert Learning with Hypernetworks 1: Input: Training data with long-tailed distribution D train 2: Output: Ensemble of expert models E = {E 1 , E 2 , . . . , E K } 3: Initialize shared feature extractor F 4: Initialize hypernetwork H θ with weights θ 5: Initialize expert loss function L (e.g., DiverseExpertLoss)
Data: 

Figure tab_7: 6
Type: table
Caption: List of Key Symbols in PseudoCode {E 1 , E 2 , . . . , E K } Ensemble of K expert models F
Data: SymbolDescriptionD trainTraining data with long-tailed distributionE = Shared feature extractorH θHypernetwork with weights θLExpert loss function (e.g., DiverseExpertLoss)z kInput to hypernetwork for generating weights of expert E kϕ kWeights of expert E k generated by hypernetworkL kLoss of expert E kL divDiversity loss to encourage expert diversityL total

Figure tab_8: 7
Type: table
Caption: Top-1 accuracy on Places-LT with various unknown test class distributions.
Data: Forward-LTUni.Backward-LTMethodPrior50251052125102550Softmax✗45.6 42.7 40.2 38.0 34.131.428.4 25.4 23.4 20.8 19.4BS✗42.7 41.7 41.3 41.0 40.039.438.5 37.8 37.1 36.2 35.6MiSLAS✗40.9 39.7 39.5 39.6 38.838.337.3 36.7 35.8 34.7 34.4LADE✗42.8 41.5 41.2 40.8 39.839.238.1 37.6 36.9 36.0 35.7LADE✓46.3 44.2 42.2 41.2 39.739.439.2 39.9 40.9 42.4 43.0RIDE✗43.1 41.8 41.6 42.0 41.040.339.6 38.7 38.2 37.0 36.9SADE✗46.2 44.8 42.8 42.7 41.140.440.2 40.9 41.2 41.4 41.6LSC✗47.5 46.1 44.5 43.7 42.141.441.2 41.5 42.1 43.2 43.4BalPoE✗-----------PRL (ours)✓47.9 47.0 45.3 44.4 42.841.941.7 42.1 42.6 43.7 44.1

Figure tab_9: 8
Type: table
Caption: Top-1 accuracy on iNaturalist 2018 with various unknown test class distributions.
Data: MethodPriorForward-LT Uni. Backward-LT32123Softmax✗65.4 65.5 64.7 64.063.4BS✗70.3 70.5 70.6 70.670.8MiSLAS✗70.8 70.8 70.7 70.770.2LADE✗68.4 69.0 69.3 69.669.5LADE✓-69.1 69.3 70.2-RIDE✗71.5 71.9 71.8 71.971.8SADE✗72.3 72.6 72.7 73.073.2LSC✓-----BalPoE✗73.1 73.5 73.8 73.673.5PRL (ours)✓73.7 73.8 74.3 74.073.9G Complexity AnalysisAlthough the introduction of hypernetworks increases the total number of parameters in the model, asshown in the table below, in most cases, this does not lead to a significant increase in computationalcomplexity.ModelHypernetwork Params (MB) GFLOPsResNet-32✗0.813.1✓2.613.1ResNeXt-50✗38.2391.8✓632.7393.5ResNet-50✗39.12982.0✓159.512982.0Table

Figure tab_10: 
Type: table
Caption: • Please see the NeurIPS code and data submission guidelines (https://nips.cc/ public/guides/CodeSubmissionPolicy) for more details. • While we encourage the release of code and data, we understand that this might not be possible, so "No" is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). • The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https: //nips.cc/public/guides/CodeSubmissionPolicy) for more details. • The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. • The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. • At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).
Data: 


Formulas:
Formula formula_0: R_{test} (f ) = R_m (f ) + \sum_{i=1}^K (\pi_{test_i} - \pi_{m_i}) • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] (1)

Formula formula_1: δ(E_i , E_j ) = \frac{1}{2} \sum_{k=1}^K |\pi_{i_k} - \pi_{j_k}|

Formula formula_2: R_{test} (f ) ≤ R_m (f ) + 2M • (δ(E_m , E_{test}) + ∆(E_1 , \ldots , E_M )) (2)

Formula formula_3: R_m ( f ) = \frac{1}{N} \sum_{i=1}^N R_m (f_i) (3)

Formula formula_4: R_{test} ( f ) ≤ \frac{1}{N} \sum_{m=1}^N R_m (f_m) + 2M • \frac{1}{N} \sum_{m=1}^N δ(E_m , E_{test}) + \frac{N - 1}{N} ∆(E_1 , \ldots , E_N) (4)

Formula formula_5: M = \max_{i,x} ℓ( f (x), i).

Formula formula_6: ∆_M := \{\alpha ∈ R^M_+ | \sum_{i=1}^M \alpha_i = 1\} (5)

Formula formula_7: P_α (x, y) := \sum_{k=1}^K \alpha_k • P k (x | y) • P k (y) (6)

Formula formula_8: \min_F \{R_{Pα_1} (F), \ldots , R_{Pα_M} (F)\} (7)

Formula formula_9: R_{Pα} (F) := E_{(x,y)∼Pα} [\frac{1}{M} \sum_{i=1}^M ℓ(f^{(i)} (x), y)]

Formula formula_10: f_i (x) = g_{w_i} (ϕ_θ (x)), \text{ for } i = 1, \ldots, T (8)

Formula formula_11: w_i = h_ψ (z_i), \text{ where } z_i \sim \text{Dir}(\alpha), \text{ for } i = 1, \ldots, T (9)

Formula formula_12: L = \sum_{i=1}^T L_i (f_i) (10)

Formula formula_13: \min_Θ \sum_{i=1}^T L_i (Θ, D) (11)

Formula formula_14: \min_Θ \max_{p∈∆_T} \sum_{i=1}^T p_i L_i (Θ, D) (12)

Formula formula_15: ∆_T := \{p ∈ R^T_+ | \sum_{i=1}^T p_i = 1\} \text{ is the T-dimensional simplex.}

Formula formula_16: \min_Θ [\sum_{i=1}^T L_i (Θ, D) + λ • \log (\sum_{i=1}^T \exp(\frac{1}{λ} L_i (Θ, D)))] (13)

Formula formula_17: \alpha^* = (\alpha^*_1 , \alpha^*_2 , \alpha^*_3 )^⊤ ∈ ∆_3

Formula formula_18: r' = r ⊙ \alpha^* / (r^⊤ \alpha^*) (14)

Formula formula_19: # Removed redundant line

Formula formula_20: \hat{W}_i = h_ψ (r'), \text{ for } i = 1, \ldots, T (15)

Formula formula_21: \hat{y}_i = \begin{cases} x^⊤ / \|x\|_2 • \hat{W}_i / \| \hat{W}_i \|_F , & \text{if normalized feature} \\ x^⊤ \hat{W}_i + b^⊤_i , & \text{otherwise} \end{cases} (16)

Formula formula_22: \min_{x∈X} \{f_1 (x), \ldots , f_m (x)\} (17)

Formula formula_23: \min_{x∈X} \max_{1≤i≤m} w_i (f_i (x) - z^*_i) (18)

Formula formula_24: \min_θ \{L_1 (f_θ ), \ldots , L_m (f_θ )\} (19)

Formula formula_25: \min_ϕ E_{w∼p(w)} [\max_{1≤i≤m} w_i (L_i (f_{h_ϕ (z)}) - z^*_i)] (20)

Formula formula_26: # Removed redundant line

Formula formula_27: R_{test} (f ) = E_{(x,y)∼Ptest(x,y)} [ℓ(f (x; θ), y)] (21)

Formula formula_28: R_{test} (f ) = \sum_{i=1}^K \pi_{test_i} • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] (22)

Formula formula_29: R_m (f ) = \sum_{i=1}^K \pi_{m_i} • E_{x∼Pm(x|y=i)} [ℓ(f (x; θ), i)] (23)

Formula formula_30: R_{test} (f ) = \sum_{i=1}^K \pi_{test_i} • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] (24)

Formula formula_31: = \sum_{i=1}^K \pi_{m_i} • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] + \sum_{i=1}^K (\pi_{test_i} -\pi_{m_i}) • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] = R_m (f ) + \sum_{i=1}^K (\pi_{test_i} -\pi_{m_i}) • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)]

Formula formula_32: R_{test} (f ) = R_m (f ) + \sum_{i=1}^K (\pi_{test_i} -\pi_{m_i}) • E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] (25)

Formula formula_33: E_{x∼Ptest(x|y=i)} [ℓ(f (x; θ), i)] ≤ M (26) Therefore, R_{test} (f ) ≤ R_m (f ) + \sum_{i=1}^K (\pi_{test_i} -\pi_{m_i}) • M = R_m (f ) + M • \sum_{i=1}^K |\pi_{test_i} -\pi_{m_i}| = R_m (f ) + 2M • δ(E_m , E_{test})

Formula formula_34: \frac{1}{2} \sum_{i=1}^K |\pi_{m_i} - \pi_{test_i}|. Next, we use the triangle inequality to bound δ(E_m , E_{test}): δ(E_m , E_{test}) ≤ δ(E_m , E_j ) + δ(E_j , E_{test}) ≤ \max_{i,j∈\{1,...,M \}} δ(E_i , E_j ) + δ(E_j , E_{test}) ≤ ∆(E_1 , \ldots , E_M ) + δ(E_j , E_{test})

Formula formula_35: δ(E_m , E_{test}) ≤ ∆(E_1 , \ldots , E_M ) + (1/M) * \sum_{j=1}^M δ(E_j , E_{test}) (27)

Formula formula_36: R_{test} (f ) ≤ R_m (f ) + 2M • (∆(E_1 , \ldots , E_M ) + (1/M) * \sum_{j=1}^M δ(E_j , E_{test})) (28)

Formula formula_37: R_{test} ( f ) = E_{(x,y)∼Ptest} [ℓ( f (x), y)] = (1/N) * \sum_{i=1}^N E_{(x,y)∼Ptest} [ℓ(f_i (x), y)] = (1/N) * \sum_{i=1}^N R_{test} (f_i)

Formula formula_38: R_{test} ( f ) ≤ (1/N) * \sum_{i=1}^N [R_{m(i)} (f_i) + 2M • (δ(E_{m(i)} , E_{test}) + ∆(E_1 , \ldots , E_N))] = (1/N) * \sum_{i=1}^N R_{m(i)} (f_i) + (2M/N) * \sum_{i=1}^N δ(E_{m(i)} , E_{test}) + 2M * ∆(E_1 , \ldots , E_N)

Formula formula_39: (1/N) * \sum_{i=1}^N δ(E_{m(i)} , E_{test}) = (1/N) * \sum_{m=1}^N (\sum_{i:m(i)=m} 1) * δ(E_m , E_{test}) ≤ (1/N) * \sum_{m=1}^N N_m • δ(E_m , E_{test})

Formula formula_40: R_{test} ( f ) ≤ (1/N) * \sum_{m=1}^N R_m (f_m) + 2M • (1/N) * \sum_{m=1}^N δ(E_m , E_{test}) + ((N-1)/N) * ∆(E_1 , \ldots , E_N)

Formula formula_41: ϕ_k = H_θ (z_k) \text{ \{Generate expert weights ϕ_k from hypernetwork\}}

Formula formula_42: E_k = G_{ϕ_k} (f) \text{ \{Obtain expert predictions using ϕ_k, where G is the classifier head\}}

Formula formula_43: L_{total} = \sum_k L_k + λ * L_{div}

Formula formula_44: O(D) + O(M × N ) = O(M × N ).
