[
  [
    {
      "role": "user",
      "content": "We have a question or proposition, and we would like to know your belief in it being resolved as \"ACCEPTED\" (as opposed to \"REJECTED\"), represented by a floating-point probability in [0,1]. As an independent observer, you will be shown a list of consecutive reasoning steps in JSON format, each step with a belief field that's currently set to None. You are to complete the reasoning steps by filling your beliefs into the belief fields while keeping all other fields unchanged.\n\nThe proposition or question: You are an area chair of the venue ICLR_cc_2022_Conference. You are given the following information about a submission in your cohort.\n{\"title\": \"Is Importance Weighting Incompatible with Interpolating Classifiers?\", \"abstract\": \"Importance weighting is a classic technique to handle distribution shifts. However, prior work has presented strong empirical and theoretical evidence demonstrating that importance weights can have little to no effect on overparameterized neural networks. \\\\emph{Is importance weighting truly incompatible with the training of overparameterized neural networks?} Our paper answers this in the negative. We show that importance weighting fails not because of the overparameterization, but instead, as a result of using exponentially-tailed losses like the logistic or cross-entropy loss. As a remedy, we show that polynomially-tailed losses restore the effects of importance reweighting in correcting distribution shift in overparameterized models. We characterize the behavior of gradient descent on importance weighted polynomially-tailed losses with overparameterized linear models, and theoretically demonstrate the advantage of using polynomially-tailed losses in a label shift setting. Surprisingly, our theory shows that using weights that are obtained by exponentiating the classical unbiased importance weights can improve performance. Finally, we demonstrate the practical value of our analysis with neural network experiments on a subpopulation shift and a label shift dataset. When reweighted, our loss function can outperform reweighted cross-entropy by as much as 9\\\\% in test accuracy. Our loss function also gives test accuracies comparable to, or even exceeding, well-tuned state-of-the-art methods for correcting distribution shifts.\", \"authors\": [{\"affiliation\": {\"institution\": \"Stanford University\", \"laboratory\": \"\", \"location\": {\"addrLine\": null, \"country\": null, \"postBox\": null, \"postCode\": null, \"region\": null, \"settlement\": null}}, \"name\": \"ke alexander wang\"}, {\"affiliation\": {\"institution\": \"Stanford University\", \"laboratory\": \"\", \"location\": {\"addrLine\": null, \"country\": null, \"postBox\": null, \"postCode\": null, \"region\": null, \"settlement\": null}}, \"name\": \"niladri s chatterji\"}, {\"affiliation\": {\"institution\": \"Stanford University\", \"laboratory\": \"\", \"location\": {\"addrLine\": null, \"country\": null, \"postBox\": null, \"postCode\": null, \"region\": null, \"settlement\": null}}, \"name\": \"saminul  haque\"}, {\"affiliation\": {\"institution\": \"Stanford University\", \"laboratory\": \"\", \"location\": {\"addrLine\": null, \"country\": null, \"postBox\": null, \"postCode\": null, \"region\": null, \"settlement\": null}}, \"name\": \"tatsunori  hashimoto\"}], \"summary\": null, \"field_of_study\": [\"Computer Science\", \"Mathematics\"], \"venue\": \"ICLR_cc_2022_Conference\", \"publication_date\": \"2021-12-24\", \"n_references\": 23, \"n_citations\": 20, \"introduction\": null, \"background\": null, \"methodology\": null, \"experiments_results\": null, \"conclusion\": null, \"full_text\": null, \"reviews\": [{\"clarity\": null, \"confidence\": 0.5, \"correctness\": 0.6666666666666666, \"ethics\": null, \"impact\": null, \"novelty\": 0.75, \"reproducibility\": null, \"review\": {\"limitations\": null, \"main_review\": \"main_review: [Strengths]\\n- Clear presentation. Figure 1 gives an intuitive example illustrating why importance weighting could fail on exponentially-tailed losses.\\n- The proposed polynomially-tailed loss could be a very practical tool for importance weighting.\\n- Theoretical results look solid and well justify the polynomially-tailed loss.\\n\\n[Weaknesses]\\n- The theoretical results are based on the assumption of sub-Gaussian data which could be unrealistic.  \\n- The polynomiallhy-tailed loss does not seem to admit a probabilistic interpretation as the distance between two distributions. How would it perform on a softmax layer?\\n- The proposed loss has a parameter alpha. However, the impact of the parameter is not discussed in either the theoretical or empirical section. Does the choice of alpha significantly affect the result? \\n\\n\", \"paper_summary\": \"paper_summary: In this paper, the author investigates whether importance weighting is incompatible with the training of overparameterized neural networks. In contrast to the recent observation that  importance weighting is ineffective in current deep learning paradigm (Byrd & Lipton 2019), the authors show that it could actually be helpful using polynomially-tailed losses. Both theoretical justifications and empirical evidences are provided to support the claim.\", \"questions\": null, \"review_summary\": \"review_summary: Overall, I think the paper provides answers to an important question. Both theoretical and empirical results are provided to corroborate the claim.\", \"strength_weakness\": null, \"title\": null}, \"review_id\": \"rKZXby1L_r\", \"score\": 0.8}, {\"clarity\": null, \"confidence\": 0.75, \"correctness\": 1.0, \"ethics\": null, \"impact\": null, \"novelty\": 0.75, \"reproducibility\": null, \"review\": {\"limitations\": null, \"main_review\": \"main_review: #### Strengths\\n\\n1. This paper points out a potential way to make training overparameterized neural nets compatible with importance sampling, which is to change the loss from cross-entropy to a polynomially-tailed loss.\\n2. A full characterization for the implicit bias of GD on polynomially-tailed loss is given by (2) in Section 4.1 (although it is an easy corollary of Ji et al. 2020). It is clear from (2) that the implicit bias changes with the importance weights.\\n3. The separation between polynomially-tailed loss and exponentially-tailed loss in Section 4.2 is simple and interesting.\\n\\n#### Weaknesses\\n\\n1. The theoretical analysis is only for linear classification. As for the generalization aspect, the analysis is even more restricted as it only holds for the dataset with two sub-Gaussian clusters. So the theoretical result is not very strong. Although we can certainly draw some insights from this simple setting, it is still unclear what generalization guarantees polynomially-tailed loss has in general.\\n2. The paper claims that the implicit bias of GD changes with importance weights, but such change could be minor when the degree of the polynomial is high, i.e., $\\\\alpha$ is large. In the equation (2) in Section 4.2, if $\\\\alpha \\\\to \\\\infty$, then $\\\\hat{\\\\theta}_{\\\\alpha}$ is just the max-margin solution. Thus the choice of $\\\\alpha$ cannot be arbitrary, and a polynomially-tailed loss with a high degree can perform as badly as the exponential loss. In the later sections, the authors always use $\\\\alpha = 1$. It would be nice if the authors can carefully discuss more the choice of $\\\\alpha$ and provide a recommendation for the choice of $\\\\alpha$. If $\\\\alpha = 1$ is the best in the theoretical setting in Section 4.2 and in the experiments, then the paper should highlight linearly-tailed loss rather than polynomially-tailed loss.\\n3. The experiment results are not compared against many baselines. It is unclear whether the 2%-3% test accuracy difference can be considered to be a big gain, and whether other methods (e.g., undersampling) can obtain the same gain in terms of the test accuracy. Although this paper is mainly a theoretical paper, the authors do claim that polynomially-tailed loss has practical value, and the current experiment results are not strong enough to support this claim.\\n\\n#### Minor Comments\\n\\n* Page 20 looks wried.\\n* The current proof for Theorem 4.3 is trajectory-based, i.e., it analyzes the trajectory. However, Soudry et al. (2018) have already proved that the convergent classifier is the max-margin classifier, and one can easily show that the max-margin classifier is unique. So I'm wondering whether Theorem 4.3 can be proved by just analyzing the properties of the max-margin classifier, without analyzing the trajectory. It is not a weakness of this paper, but if this paper can provide an argument directly for the max-margin classifier, then the readers will better understand why the max-margin classifier is bad in this setting. After skimming the proof, I think it might be possible to translate Lemma C.3 into a lemma on the upper bounds for the ratio between dual variables (instead of loss values), and it sounds promising to use this upper bound to bound the influence of the samples from the minority class, and hence the max-margin classifier generalizes poorly.\\n* Section 4.1 should clarify that the optimal solution for (2) is unique. Although it is proved in the appendix that the solution is unique, but clarifying this in the main body can avoid confusion.\", \"paper_summary\": \"paper_summary: Previous works pointed out an interesting incompatibility between importance sampling and training overparameterized neural nets. In the view of margin, Soudry et al. (2018) showed that GD on exponentially-tailed loss converges to the max-margin classifier, and Xu et al. (2020) pointed out that this margin maximization is unaffected by importance sampling. This paper then turns to studying polynomially-tailed loss.\\n\\n* This paper establishes the implicit bias by formulating an optimization problem, which can change according to the importance weights. The convergent classifier is shown to generalize better than the exponentially-tailed loss in a simple linear classification setting, where the dataset contains two sub-Gaussian clusters.\\n* Experiments on Imbalanced Binary CIFAR10 and CelebA show that the polynomially-tailed loss performs better on imbalanced datasets.\", \"questions\": null, \"review_summary\": \"review_summary: This paper provides interesting theoretical results for using polynomially-tailed loss to resolve the incompatibility between importance sampling and training overparameterized neural nets. However, the theoretical setting is quite restrictive. I think this paper is on the borderline, but it is ok to accept this paper.\\n\\n------------------\\nPost-rebuttal update:\\n\\nI am glad to see that the authors have conducted more experiments to compare their method with baselines. I'm not an expert in the empirical methods for dealing with distribution shift, but I think the current empirical results they have are great additions to their theoretical analysis. I have increased my score for the empirical novelty and significance accordingly.\\nAs for the theoretical analysis, which is the main selling point of this paper, I still believe the current setting is restrictive, although I know some previous works are studying it. Therefore, I would like to continue to vote for weak acceptance.\", \"strength_weakness\": null, \"title\": null}, \"review_id\": \"Vsir-xRW3Kj\", \"score\": 0.6000000000000001}, {\"clarity\": null, \"confidence\": 0.75, \"correctness\": 1.0, \"ethics\": null, \"impact\": null, \"novelty\": 0.5, \"reproducibility\": null, \"review\": {\"limitations\": null, \"main_review\": \"main_review: Strengths:\\n- The paper is well-written and enjoyable to read\\n- The theoretical claims are sound and well presented\\n- Theorem 4.2 is interesting: It extends the analysis by Chatterji & Long to imbalanced setting and the finding about the scaling of the weight being cubic in \\\\tau is interesting.\\n\\nWeaknesses:\\n- The study is limited to binary classification. It is not clear how to extend the loss to multiclass. At least when it comes to experimental results, multiclass experiments would make a stronger case. \\n- Experimental results are limited to a comparison between weighted cross entropy and the new loss. How does the new loss compare to other const-sensitive techniques, e.g. LDAM loss [1], Logit-adjusted loss [2], VS loss [3], DRO (Sagawa et al.)?\\n- More generally, while the presentation is from the viewpoint of distribution shift, all study cases (both theoretical and experimental) are with respect to imbalanced data problems. In this setting there has been a lot of recent work, which appears closely related but is not discussed (eg. [1-4]). Specifically, [3] appears to be very closely related (for example, see their Theorem 1; how does your program in (2) compare to the cost-sensitive svm?). I suggest that the authors try to better place their work within this literature.\\n\\n[1] Cao, Kaidi, et al. \\\"Learning imbalanced datasets with label-distribution-aware margin loss.\\\" arXiv preprint arXiv:1906.07413 (2019).\\n[2] Menon, Aditya Krishna, et al. \\\"Long-tail learning via logit adjustment.\\\" arXiv preprint arXiv:2007.07314 (2020).\\n[3] Kini, Ganesh Ramachandra, et al. \\\"Label-Imbalanced and Group-Sensitive Classification under Overparameterization.\\\" arXiv preprint arXiv:2103.01550 (2021).\\n[4] Kang, Bingyi, et al. \\\"Decoupling representation and classifier for long-tailed recognition.\\\" arXiv preprint arXiv:1910.09217 (2019).\", \"paper_summary\": \"paper_summary: The paper shows both theoretically and experimentally that importance weighting is not incompatible  with the training of overparameterized models provided that the training loss is appropriately modified so that it does not have an exponential tail. Specifically, for binary classification the authors propose a new loss function with polynomial tail decay. Theoretically, the new loss is shown to outperform weighted cross entropy for linear models and an imbalanced mixtures of gaussian model. Empirically, the new loss outperforms weighted cross entropy on imbalanced CIFAR 10 and the CelebA dataset. \", \"questions\": null, \"review_summary\": \"review_summary: While I appreciate the author's clarity of presentingtheir ideas/theories/proofs/experiments and the paper's finding is quite interesting, it is perhaps not so surprising especially given the recent works and findings on mitigating the effect of weighted CE failure for imbalanced data. Also, the implications of the finding are not made clear. How does the proposed loss compare to state-of-the-art loss adjustments? Does it apply to multiclass settings? \\n\", \"strength_weakness\": null, \"title\": null}, \"review_id\": \"KsW1C6lYwr\", \"score\": 0.4}, {\"clarity\": null, \"confidence\": 0.75, \"correctness\": 1.0, \"ethics\": null, \"impact\": null, \"novelty\": 0.75, \"reproducibility\": null, \"review\": {\"limitations\": null, \"main_review\": \"main_review: This paper studies the impact of importance-weighting in interpolating classifiers. The authors showed that in contrast to the prior works that showed importance-weighting is ineffective due to implicit bias of converging to max-margin, switching to poly-tailed loss circumvents this issue, both theoretically and empirically.\\n\\nPros:\\nOverall, this paper is very well-written and the theoretical results are sound. I really like the idea of using a poly-tailed loss - it is simple, effective, and practical - but by far not an obvious one. This idea circumvents the previous negative results in an elegant manner. \\n\\nI also like the analysis of poly-tailed loss under Gaussian mixture setting: there are some surprising takeaways, e..g. inverse-frequency weighting being sub-optimal and the optimal re-weighting being roughly cubic in class imbalance. I have never seen these types of results in the prior works.\\n\\nCons:\\n\\nThere are some obvious limitations, for example, the gap between linear classifiers and neural networks used in practice, which is a minor issue due to the intractability of analysis. \\n\\nThere are some more interesting possible extensions, though: for example, a theoretical analysis of (and comparing against) early stopping would be nice.\\n\\nThere is limited discussion about how to pick the hyperparameter alpha, which can be useful for practitioners.\\n\\nThe performance boost (2%-3%) is relatively minor. I am not sure about how it compares to stronger baselines.\", \"paper_summary\": \"paper_summary: This paper studies the impact of importance-weighting in interpolating classifiers. The authors showed that in contrast to the prior works that showed importance-weighting is ineffective due to implicit bias of converging to max-margin, switching to poly-tailed loss circumvents this issue, both theoretically and empirically.\", \"questions\": null, \"review_summary\": \"review_summary: I like the paper due to its simplicity, effectiveness, and interesting take-aways from theory. There is some room for improvement in the experiments.\", \"strength_weakness\": null, \"title\": null}, \"review_id\": \"18izF6aYe9r\", \"score\": 0.6000000000000001}], \"comments\": [{\"comment\": \"We wanted to reach out once again and ask whether our rebuttal addressed your concerns. It felt to us that your main criticism of the main paper was the lack of detailed empirical evaluation. We have thoroughly addressed this in our rebuttal (see our top-level comment in OpenReview) and we hope that you can raise your score as a result.\", \"title\": \"We hope our rebuttal addressed your concerns\"}, {\"comment\": \"I thank the authors for their responses. I appreciate their effort and time on the additional experimental results. The additional discussion and comparisons to [1-5] are indeed strengthening the paper. I am increasing my score (with my main concern making it borderline the lack of a multiclass version). Overall, I am ok with the paper being accepted once the proposed edits are made.\\n\", \"title\": \"Thank you for the responses\"}, {\"comment\": \"> The theoretical analysis is only for linear classification. As for the generalization aspect, the analysis is even more restricted as it only holds for the dataset with two sub-Gaussian clusters. \\n\\nWe agree that extending these results to non-linear models and richer data distributions would indeed be very interesting. However, even in the absence of any distribution shift the generalization analysis of overparameterized linear classifiers is an active area of research recently: (Chatterji & Long, 2021; Wang & Thrampoulidis, 2021; Liang & Recht, 2021; Cao et al., 2021). We therefore study linear models to perform a fine-grained analysis to garner intuition that transfers over to non-linear models.\\n\\n> It would be nice if the authors can carefully discuss more the choice of alpha\\n\\nWe agree, please see our top level response on choosing $\\\\alpha$.\\n\\n> The experiment results are not compared against many baselines\\n\\nWe have now trained over 450 models to compare to many strong baselines, including undersampling, as you suggested. Our method performs as well or better than prior distribution shift methods, while undersampling remains a very strong baseline on CelebA (results are presented in our top level OpenReview response)\\n\\n> I'm wondering whether Theorem 4.3 can be proved by just analyzing the properties of the max-margin classifier, without analyzing the trajectory\\n\\nThis is an interesting idea, and perhaps it is indeed possible to study the test error of the maximum-margin classifier without analyzing the trajectory. Here, since we do not have a closed-form expression for the maximum-margin classifier, directly analyzing the iterates of gradient descent allowed us to derive tight lower bounds on the test error.\\n\", \"title\": \"Addressing your concerns\"}, {\"comment\": \"We agree that placing our method in the context of prior distribution shift methods would greatly strengthen our paper. We *extensively* tune and compare to all methods suggested by the reviewers: undersampling as reviewer DWSu suggested, and the cost-sensitive methods that reviewer AUFR listed [1-4], which we will also cite. We also compare each loss function with and without DRO [5]. For each method, we fix the seed and grid search thoroughly on the validation set. We then train the best setting, averaging over 10 model initializations. We set the training epochs such that all methods attained 100% train accuracy. Previously, we did not tune $\\\\alpha$, but instead always set it equal to 1. **After tuning all methods, our new experiments show that poly-tailed loss with exponentiated weights gives accuracies comparable to or even exceeding those by recently proposed methods.**\\n\\n## CelebA\\n\\n| Name                                   | Summary of method                                            | Best settings                | Mean test acc     | Std err   | Worst group test acc  | Std err   |\\n|------                                  |-------------------                                           |---------------               |------             |-----      |------                 |-----      |\\n| Cross Entropy                          |                                                              | N/A                          | 79.6              | 1.2       | 43.1                  | 2.9       |\\n| Cross Entropy + Group undersampling    | Balance the groups by throwing away overrepresented data     | N/A                          | 82.3              | 3.2       | **72.1**              | 8.6       |\\n| Cross Entropy + IW                     | Classical importance weighting                               | N/A                          | 79.3              | 0.4       | 43.1                  | 2.9       |\\n| VS Loss [3]                            | Multiply by and add class dependent constants to logits      | $\\\\gamma=0.4$                 | 82.6              | 0.7       | 51.3                  | 1.5       |\\n| Poly-tailed Loss                       | Change loss function and exponentiate importance weights     | $\\\\alpha=2, \\\\rho=2.5$         | 82.4              | 0.8       | 53.3                  | 1.5       |\\n| -----                                  | -----                                                        |-----                         |-----              |-----      |-----                  |-----      |\\n| Cross Entropy + DRO [5]                |                                                              | N/A                          | 84.8              | 0.9       | 60.4                  | 2.4       |\\n| VS Loss + DRO                          | Change loss function and use DRO                             | $\\\\gamma=0.4$                 | 83.2              | 0.9       | 58.2                  | 2.0       |\\n| Poly-tailed Loss + DRO                 | Change loss function and use DRO                             | $\\\\alpha=2$                   | **86.7**          | 1.1       | **66.5**              | 1.9       |\\n\\nOn CelebA, we compare mostly to VS loss since VS loss encapsulates LA loss, CDT loss, and LDAM loss. Poly-tailed loss with IW achieves comparable test accuracy to VSLoss. When we use both loss functions with DRO, poly loss + DRO is better than VSLoss + DRO by 3.5% on test accuracy and 8% better on worst group accuracy.\\nFor all methods, including ones that use DRO, we train all models to interpolation without early stopping or strong regularizations, since we are interested in interpolating classifiers. We emphasize that *we expect poly loss with IW to perform well on average test accuracy*, and that poly-tailed with IW does not target worst group accuracy directly in the training objective. However, in practice poly-tailed loss with IW gives the best worst group accuracy out of all loss functions.\\n\\n### Tuning procedure\\n\\n- Cross entropy + undersampling by group: we use the same hyperparameters as Cross entropy on the full dataset and average results over random undersamplings plus initializations.\\n\\n- VS Loss: We follow the tuning procedure in [3] for group-sensitive VS loss and grid search over $\\\\gamma \\\\in \\\\[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9\\\\]$. We found $\\\\gamma=0.4$ to be best.\\n\\n- Poly Loss: We grid search over $(\\\\alpha, \\\\rho) \\\\in \\\\[1, 2, 3\\\\] \\\\times \\\\[1.0, 1.5, 2.0, 2.5, 3.0\\\\]$. We found $\\\\alpha=2, \\\\rho=2.5$ to be best.\\nCross entropy + DRO: We tune the adversarial step size $\\\\eta_q$ over $\\\\[0.0025, 0.005, 0.01, 0.02, 0.03, 0.04, 0.05, 0.1\\\\]$. We found that $\\\\eta_q=0.05$ to be best.\\n\\n- VS Loss + DRO: We used the best $\\\\gamma$ from VS loss without DRO with $\\\\eta_q=0.05$\\n\\n- Poly Loss + DRO: We used the best $\\\\alpha$ from Poly Loss without DRO with $\\\\eta_q=0.05$\", \"title\": \"Empirical comparisons to other works\"}, {\"comment\": \"[1] Cao, Kaidi, et al. \\\"Learning imbalanced datasets with label-distribution-aware margin loss.\\\" arXiv preprint arXiv:1906.07413 (2019).\\n\\n[2] Menon, Aditya Krishna, et al. \\\"Long-tail learning via logit adjustment.\\\" arXiv preprint arXiv:2007.07314 (2020).\\n\\n[3] Kini, Ganesh Ramachandra, et al. \\\"Label-Imbalanced and Group-Sensitive Classification under Overparameterization.\\\" arXiv preprint arXiv:2103.01550 (2021).\\n\\n[4] Kang, Bingyi, et al. \\\"Decoupling representation and classifier for long-tailed recognition.\\\" arXiv preprint arXiv:1910.09217 (2019).\\n\\n[5] Sagawa, et al. \\u201cDistributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization\\u201d arXiv preprint arXiv:1911.08731 (2019).\\n\", \"title\": \"References mentioned\"}, {\"comment\": \"> There is limited discussion about how to pick the hyperparameter alpha, which can be useful for practitioners.\\n\\nWe agree, please see our top level response on choosing $\\\\alpha$.\\n\\n> The performance boost (2%-3%) is relatively minor. I am not sure about how it compares to stronger baselines.\\n\\nGood point! We have trained over 450 models to compare to a variety of strong baselines, showing that our method performs as well or better than prior distribution shift methods (results are presented in our top level OpenReview response). With proper tuning, our loss function can increase performance by 10% or more compared to cross entropy.\\n\\n\\n> There are some more interesting possible extensions, though: for example, a theoretical analysis of (and comparing against) early stopping would be nice.\\n\\nThe theoretical comparison between early-stopping and poly loss is indeed very interesting and something we have been thinking about. We analyzed a simplified data model where all centered feature vectors have the same norm and are mutually orthogonal, as opposed to the approximate orthogonality obtained in Lemma B.2. We have preliminary results showing that the test error of optimally early-stopping on cross-entropy loss scales as $\\\\exp(-c\\\\lVert \\\\mu \\\\rVert^{4}/d)$, which is the same as the upper bound for test error on poly loss with cubic weights. Under this data model, we can also further refine the upper bound for the importance weighted poly loss classifier to $\\\\exp(-c(\\\\lVert \\\\mu \\\\rVert^2 n) / (d \\\\cdot \\\\tau))$, which is tighter than the current bound in Theorem 4.2. While this analysis is on a much simplified data model, we believe that it is a reasonable approximation of the overparameterized regime with sub-Gaussian data, so we expect to be able to generalize the results. If these claims hold in general, then it will be shown that cubic weighted poly loss outperforms early-stopped cross-entropy when the imbalance $\\\\tau=o(n)$.\", \"title\": \"Addressing your concerns\"}, {\"comment\": \"While the multiplicative logit-adjustment losses (VS loss and CDT loss [3]) do affect the implicit bias of the final learned classifier, they do not demonstrate that reweighted losses can solve dataset imbalance. Even if existing results address this, the theory in our work provides stronger results in several ways:\\n- VS loss theory [3] guarantees balanced test error across groups asymptotically in $n$ and $d$ for some optimal hyperparameter.\\n- This leaves open several important questions about the optimality of distribution shift correction which we address in our work.\\n    - First, can we achieve learning at the information theoretic limit of $\\\\lVert \\\\mu \\\\rVert^2 / d \\\\to \\\\infty$? Our work shows poly loss attains this (Thm 4.2).\\n    - Second, even if group errors are balanced, the errors for the groups may still be high. Our work shows poly loss attains low per-group error, not just balanced error and that this exceeds the performance of max-margin predictors (Thms 4.2, 4.3)\\n    - Finally, the result for VS loss relies on the existence of some, unknown hyperparameters which may vary greatly across different problem instances (as shown in our CIFAR-10 experiments), we show that for poly-loss importance weighting in the linear setting, a single fixed universal hyperparameter can perform well across instances ($w=\\\\tau^3$, Thm 4.2).\\n\", \"title\": \"Theoretical comparisons to other works\"}, {\"comment\": \"> The theoretical results are based on the assumption of sub-Gaussian data which could be unrealistic.\\n\\nWe agree that extending our results to richer data distributions would indeed be very interesting. However, even in the absence of any distribution shift the generalization analysis of overparameterized linear classifiers in this sub-Gaussian cluster setting is an active area of research recently: (Chatterji & Long, 2021; Wang & Thrampoulidis, 2021; Liang & Recht, 2021; Cao et al., 2021). We therefore study linear models to perform a fine-grained analysis to garner intuition that transfers over to non-linear models.\\n\\n> The polynomially-tailed loss does not seem to admit a probabilistic interpretation\\u2026\\n\\nWe chose the poly-tailed losses since they admit the right implicit bias when trained with gradient descent. However, one can also easily derive the log-likelihood corresponding to the loss by using the relation in Buet-Golfouse (2021), Section 4.1 (http://proceedings.mlr.press/v139/buet-golfouse21a/buet-golfouse21a.pdf). One can also see that the examples in Section 5.3 of this paper shows that there are other polynomially tailed losses fulfilling our condition that follow from taking a latent variable interpretation with Cauchy or Student-t distributions.\\n\\n> Does the choice of alpha significantly affect the result?\\n\\nThough we did not tune $\\\\alpha$ in our original experiments, we do so in our new experiments. Tuning alpha can indeed improve performance significantly. Please see our top level response on choosing $\\\\alpha$.\\nWe analyze the theory for $\\\\alpha=1$, similar analyses can be done for other values of $\\\\alpha$. \", \"title\": \"Addressing your concerns\"}, {\"comment\": \"## Imbalanced Binary CIFAR10\\n| Name                                                       | Summary of method                                        | Best settings          | Mean test acc      | Std err   |\\n|-----                                                       |-------------------                                       |---------------         |------              |------     |\\n| Cross Entropy                                              |                                                          | N/A                    | 60.5               | 0.8       |\\n| Cross Entropy + IW                                         | Classical importance weighting                           | N/A                    | 57.8               | 0.4       |\\n| Cross Entropy + Class undersampling                        | Balance the classes by throwing away majority data       | N/A                    | 58.5               | 1.4       |\\n| LDAM [1]                                                   | Add class dependent constants to logits                  | largest margin=1.0     | 58.7               | 1.6       |\\n| CDT [4]                                                    | Multiply class dependent constants by logits             | $\\\\gamma=0.5$           | 60.2               | 1.6       |\\n| LA Loss [2]                                                | Add class dependent constants to logits                  | $\\\\tau=3$               | 66.9               | 2.1       |\\n| VS Loss (best setting around range suggested by [3])       | Multiply by and add class dependent constants to logits  | $\\\\tau=1, \\\\gamma=0.05$  | **58.6**           | 1.4       |\\n| VS Loss (our own tuning)                                   | Multiply by and add class dependent constants to logits  | $\\\\tau=3, \\\\gamma=0.3$   | **69.3**           | 0.9       |\\n| Poly-tailed Loss                                           | Change loss function and exponentiate importance weights | $\\\\alpha=2,\\\\rho=3$      | 67.3               | 0.8       |\\n\\nOn binary CIFAR10, our poly-tailed loss performs comparably to LA Loss and is only 2% worse than VS loss in the mean test accuracy with overlapping confidence intervals. Note that we grid searched across the published hyperparameter ranges from [3] when we first tuned VS loss, and it performed poorly (58.6% test accuracy using the best hyperparameters $\\\\tau=1, \\\\gamma=0.05$). After a more thorough grid search for VS loss, we found optimal hyperparameters that were similar to those of LA loss, increasing test accuracy by 10%. \\nIn sum, our poly-tailed loss matched or exceeded existing data imbalance adjustment methods (under published hyperparameters) and was beaten by 2% only after more extensive hyperparameter tuning on our part.\\n\\n### Tuning procedure\\n- Cross entropy + undersampling by class: we use the same hyperparameters as Cross entropy on the full dataset and average results over random undersamplings plus initializations\\n- LDAM: Following the original paper [1], we grid search the margin hyperparameter over $\\\\[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0\\\\]$.\\n- CDT Loss, LA Loss, VS Loss:  Following [3], we tune $\\\\tau$ and $\\\\gamma$ with $\\\\tau=0$ being CDT loss and $\\\\gamma = 0$ being LA Loss. We grid search across $(\\\\tau, \\\\gamma) \\\\in \\\\[0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0\\\\] \\\\times \\\\[0.0, 0.1, 0.2, 0.3, 0.4, 0.5\\\\]$. Unlike the original paper [3], we found that the best hyperparameters of VS loss are similar to those of LA Loss. We initially followed the recommendations of [3] and grid searched over $(\\\\tau, \\\\gamma) \\\\in \\\\[0.5, 0.75, 1.0, 1.25, 1.5\\\\] \\\\times \\\\[0.05, 0.1, 0.15, 0.2\\\\]$, but it gave VS loss poor performance (mean test acc = 58.6).\\n- Poly-tailed Loss: We grid search over $(\\\\alpha, \\\\rho) \\\\in \\\\[0.25, 0.5, 1.0, 2.0, 4.0\\\\] \\\\times \\\\[1.0, 1.5, 2.0, 2.5, 3.0\\\\]$.\\n\", \"title\": \"Empirical comparisons to other works (cont'd)\"}, {\"comment\": \"- Though we analyze the theory for $\\\\alpha=1$, similar analyses can be done for other values of $\\\\alpha$. \\n- We did not empirically tune $\\\\alpha$ in our original experiments, using $\\\\alpha=1$ always. However, in the new results above, we tune over $\\\\alpha$ and the weight exponent to match the fact that LA/CDT/VS baselines tune hyperparameters. Our tuning procedures described above can provide some additional guidelines on reasonable ranges for $\\\\alpha$ empirically. We found training to be less stable for $\\\\alpha < 1$.\\n\", \"title\": \"Choice for $\\\\alpha$\"}, {\"comment\": \"We thank all reviewers for their feedback, which has helped us improve our paper. We emphasize that our paper seeks to *understand* the failure of importance weighting for interpolating models, as observed by prior work, and provide a remedy for the procedure. Our goal was not to claim our method exceeds state of the art on imbalance data correction. However, at the reviewers' suggestion, we have performed comparisons to existing SoTA methods, and we show that our method (as well as simple baselines such as undersampling, early stopping) generally match the performance of these methods (see tables below).\\n\\nThere are many imbalanced data correction methods that modify the logits (e.g. [1-4]), but they do not address our movitivating scientific question: \\n>*Can weights on training losses viably correct distribution shifts for interpolating networks?*\\n\\nWe answer this question affirmatively with clear theoretical evidence and competitive empirical results.\\n\", \"title\": \"Summary comment\"}, {\"comment\": \"> Experimental results are limited to a comparison between weighted cross entropy and the new loss. How does the new loss compare to other const-sensitive techniques, e.g. LDAM loss [1], Logit-adjusted loss [2], VS loss [3], DRO (Sagawa et al.)?\\n\\nThank you for pointing us to [3], which we missed due to its recency. We will be sure to include comparisons and citations to  [1-3] in the revised draft. We did not compare to the other references originally since we were focused on showing our method fixes importance weighting for interpolating models. We have trained over 450 models to compare to all your suggested baselines, showing that our method performs comparably to or sometimes even exceeding recent distribution shift methods (results are presented in our top level OpenReview response).\\n\\n> \\u2026 while the presentation is from the viewpoint of distribution shift, all study cases (both theoretical and experimental) are with respect to imbalanced data problems\\n\\n- In our experiments we consider two types of distribution shift: label shift (CIFAR10) and sub-population shift dataset (CelebA).\\n- We believe that our theoretical analysis of generalization (Theorem 4.2 and 4.3) can potentially be extended to analyze the spurious correlations setting studied by Sagawa et al. (2020) (http://proceedings.mlr.press/v119/sagawa20a.html).\\n\\n\\n\\n> The study is limited to binary classification. It is not clear how to extend the loss to multiclass\\n\\nExtensions to multiclass classification would indeed be interesting. We note that a polynomially tailed loss can be obtained by suitably modifying the cross-entropy loss.\\n\\n\\n> [the paper's finding] is perhaps not so surprising especially given the recent works and findings on mitigating the effect of weighted CE failure for imbalanced data\\n\\nOur motivation aims to directly revive importance weighting as an effective technique for training interpolating classifiers under distribution shift. Previous techniques that multiply terms to the logits do not constitute importance weighting methods, whereas techniques that add terms to the logits do not change the asymptotic behavior of gradient descent. Our proposed loss is a fundamentally new loss function with new surprising properties, e.g. the optimal weights is the imbalance ratio raised to a power.\\n\", \"title\": \"Addressing your conerns\"}], \"references\": [{\"abstract\": \"Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this\\\"benign overfitting\\\"phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the over-parameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for over-parameterized logistic regression.\", \"arxiv_id\": \"2104.13628\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Yuan Cao\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Quanquan Gu\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"M. Belkin\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"2104.13628\", \"CorpusId\": 233423158, \"DBLP\": \"journals/corr/abs-2104-13628\", \"DOI\": null, \"MAG\": null, \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"cao|risk_bounds_for_over-parameterized_maximum_margin_classification_on_sub-gaussian_mixtures\", \"s2_corpus_id\": \"233423158\", \"title\": \"Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures\"}, {\"abstract\": \"The recent paper by Byrd&Lipton (2019), based on empirical observations, raises a major concern on the impact of importance weighting for the over-parameterized deep learning models. They observe that as long as the model can separate the training data, the impact of importance weighting diminishes as the training proceeds. Nevertheless, there lacks a rigorous characterization of this phenomenon. In this paper, we provide formal characterizations and theoretical justifications on the role of importance weighting with respect to the implicit bias of gradient descent and margin-based learning theory. We reveal both the optimization dynamics and generalization performance under deep learning models. Our work not only explains the various novel phenomenons observed for importance weighting in deep learning, but also extends to the studies where the weights are being optimized as part of the model, which applies to a number of topics under active research.\", \"arxiv_id\": \"2103.15209\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Da Xu\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Yuting Ye\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Chuanwei Ruan\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"2103.15209\", \"CorpusId\": 231807280, \"DBLP\": \"conf/iclr/XuYR21\", \"DOI\": null, \"MAG\": null, \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"xu|understanding_the_role_of_importance_weighting_for_deep_learning\", \"s2_corpus_id\": \"231807280\", \"title\": \"Understanding the role of importance weighting for deep learning\"}, {\"abstract\": \"The goal in label-imbalanced and group-sensitive classification is to optimize relevant metrics such as balanced error and equal opportunity. Classical methods, such as weighted cross-entropy, fail when training deep nets to the terminal phase of training (TPT), that is training beyond zero training error. This observation has motivated recent flurry of activity in developing heuristic alternatives following the intuitive mechanism of promoting larger margin for minorities. In contrast to previous heuristics, we follow a principled analysis explaining how different loss adjustments affect margins. First, we prove that for all linear classifiers trained in TPT, it is necessary to introduce multiplicative, rather than additive, logit adjustments so that the interclass margins change appropriately. To show this, we discover a connection of the multiplicative CE modification to the cost-sensitive support-vector machines. Perhaps counterintuitively, we also find that, at the start of training, the same multiplicative weights can actually harm the minority classes. Thus, while additive adjustments are ineffective in the TPT, we show that they can speed up convergence by countering the initial negative effect of the multiplicative weights. Motivated by these findings, we formulate the vector-scaling (VS) loss, that captures existing techniques as special cases. Moreover, we introduce a natural extension of the VS-loss to group-sensitive classification, thus treating the two common types of imbalances (label/group) in a unifying way. Importantly, our experiments on state-of-the-art datasets are fully consistent with our theoretical insights and confirm the superior performance of our algorithms. Finally, for imbalanced Gaussian-mixtures data, we perform a generalization analysis, revealing tradeoffs between balanced / standard error and equal opportunity.\", \"arxiv_id\": \"2103.01550\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Ganesh Ramachandra Kini\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Orestis Paraskevas\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Samet Oymak\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Christos Thrampoulidis\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"2103.01550\", \"CorpusId\": 232092414, \"DBLP\": \"journals/corr/abs-2103-01550\", \"DOI\": null, \"MAG\": null, \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"kini|label-imbalanced_and_group-sensitive_classification_under_overparameterization\", \"s2_corpus_id\": \"232092414\", \"title\": \"Label-Imbalanced and Group-Sensitive Classification under Overparameterization\"}, {\"abstract\": \"This paper provides elementary analyses of the regret and generalization of minimum-norm interpolating classifiers (MNIC). The MNIC is the function of smallest Reproducing Kernel Hilbert Space norm that perfectly interpolates a label pattern on a finite data set. We derive a mistake bound for MNIC and a regularized variant that holds for all data sets. This bound follows from elementary properties of matrix inverses. Under the assumption that the data is independently and identically distributed, the mistake bound implies that MNIC generalizes at a rate proportional to the norm of the interpolating solution and inversely proportional to the number of data points. This rate matches similar rates derived for margin classifiers and perceptrons. We derive several plausible generative models where the norm of the interpolating classifier is bounded or grows at a rate sublinear in $n$. We also show that as long as the population class conditional distributions are sufficiently separable in total variation, then MNIC generalizes with a fast rate.\", \"arxiv_id\": \"2101.11815\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Tengyuan Liang\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"B. Recht\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"2101.11815\", \"CorpusId\": 231719406, \"DBLP\": \"journals/jmlr/LiangR23\", \"DOI\": null, \"MAG\": null, \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"liang|interpolating_classifiers_make_few_mistakes\", \"s2_corpus_id\": \"231719406\", \"title\": \"Interpolating Classifiers Make Few Mistakes\"}, {\"abstract\": \"Deep neural networks generalize well despite being exceedingly overparameterized, but understanding the statistical principles behind this so called benign-overfitting phenomenon is not yet well understood. Recently there has been remarkable progress towards understanding benign-overfitting in simpler models, such as linear regression and, even more recently, linear classification. This paper studies benign-overfitting for data generated from a popular binary Gaussian mixtures model (GMM) and classifiers trained by support-vector machines (SVM). Our approach has two steps. First, we leverage an idea introduced in [2] to relate the SVM solution to the least-squares (LS) solution. Second, we derive novel non-asymptotic bounds on the classification error of LS solution. Combining the two gives sufficient conditions on the overparameterization ratio and the signal-to-noise ratio that lead to benign overfitting. We corroborate our theoretical findings with numerical simulations.\", \"arxiv_id\": null, \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Ke Wang\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Christos Thrampoulidis\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": null, \"CorpusId\": 227011925, \"DBLP\": \"journals/corr/abs-2011-09148\", \"DOI\": \"10.1109/ICASSP39728.2021.9413946\", \"MAG\": \"3099518927\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"wang|benign_overfitting_in_binary_classification_of_gaussian_mixtures\", \"s2_corpus_id\": \"227011925\", \"title\": \"Benign Overfitting in Binary Classification of Gaussian Mixtures\"}, {\"abstract\": \"\\u00a9 2018, Cambridge University Press Let us summarize our findings. A random projection of a set T in R n onto an m-dimensional subspace approximately preserves the geometry of T if m \\u2a86 d ( T ) . For...\", \"arxiv_id\": null, \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"O. Papaspiliopoulos\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": null, \"CorpusId\": 221910705, \"DBLP\": null, \"DOI\": \"10.1080/14697688.2020.1813475\", \"MAG\": \"3088300790\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"papaspiliopoulos|high-dimensional_probability:_an_introduction_with_applications_in_data_science\", \"s2_corpus_id\": \"221910705\", \"title\": \"High-Dimensional Probability: An Introduction with Applications in Data Science\"}, {\"abstract\": \"Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples. This poses a challenge for generalisation on such labels, and also makes naive learning biased towards dominant labels. In this paper, we present two simple modifications of standard softmax cross-entropy training to cope with these challenges. Our techniques revisit the classic idea of logit adjustment based on the label frequencies, either applied post-hoc to a trained model, or enforced in the loss during training. Such adjustment encourages a large relative margin between logits of rare versus dominant labels. These techniques unify and generalise several recent proposals in the literature, while possessing firmer statistical grounding and empirical performance.\", \"arxiv_id\": \"2007.07314\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"A. Menon\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Sadeep Jayasumana\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"A. Rawat\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Himanshu Jain\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Andreas Veit\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Sanjiv Kumar\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"2007.07314\", \"CorpusId\": 220525799, \"DBLP\": \"journals/corr/abs-2007-07314\", \"DOI\": null, \"MAG\": \"3042496707\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"menon|long-tail_learning_via_logit_adjustment\", \"s2_corpus_id\": \"220525799\", \"title\": \"Long-tail learning via logit adjustment\"}, {\"abstract\": \"Recent work across many machine learning disciplines has highlighted that standard descent methods, even without explicit regularization, do not merely minimize the training error, but also exhibit an implicit bias. This bias is typically towards a certain regularized solution, and relies upon the details of the learning process, for instance the use of the cross-entropy loss. In this work, we show that for empirical risk minimization over linear predictors with arbitrary convex, strictly decreasing losses, if the risk does not attain its infimum, then the gradient-descent path and the algorithm-independent regularization path converge to the same direction (whenever either converges to a direction). Using this result, we provide a justification for the widely-used exponentially-tailed losses (such as the exponential loss or the logistic loss): while this convergence to a direction for exponentially-tailed losses is necessarily to the maximum-margin direction, other losses such as polynomially-tailed losses may induce convergence to a direction with a poor margin.\", \"arxiv_id\": \"2006.11226\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Ziwei Ji\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Miroslav Dud'ik\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"R. Schapire\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Matus Telgarsky\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"2006.11226\", \"CorpusId\": 219956727, \"DBLP\": \"conf/colt/JiDST20\", \"DOI\": null, \"MAG\": \"3036470948\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"ji|gradient_descent_follows_the_regularization_path_for_general_losses\", \"s2_corpus_id\": \"219956727\", \"title\": \"Gradient descent follows the regularization path for general losses\"}, {\"abstract\": \"In this paper, we show that although the minimizers of cross-entropy and related classification losses are off at infinity, network weights learned by gradient flow converge in direction, with an immediate corollary that network predictions, training errors, and the margin distribution also converge. This proof holds for deep homogeneous networks -- a broad class of networks allowing for ReLU, max pooling, linear, and convolutional layers -- and we additionally provide empirical support not just close to the theory (e.g., the AlexNet), but also on non-homogeneous networks (e.g., the ResNet). If the network further has locally Lipschitz gradients, we show that these gradients converge in direction, and asymptotically align with the gradient flow path, with consequences on margin maximization. Our analysis complements and is distinct from the well-known neural tangent and mean-field theories, and in particular makes no requirements on network width and initialization, instead merely requiring perfect classification accuracy. The proof proceeds by developing a theory of unbounded nonsmooth Kurdyka-Lojasiewicz inequalities for functions definable in an o-minimal structure, and is also applicable outside deep learning.\", \"arxiv_id\": \"2006.06657\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Ziwei Ji\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Matus Telgarsky\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"2006.06657\", \"CorpusId\": 219573767, \"DBLP\": \"journals/corr/abs-2006-06657\", \"DOI\": null, \"MAG\": \"3100332668\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"ji|directional_convergence_and_alignment_in_deep_learning\", \"s2_corpus_id\": \"219573767\", \"title\": \"Directional convergence and alignment in deep learning\"}, {\"abstract\": \"We compare classification and regression tasks in the overparameterized linear model with Gaussian features. On the one hand, we show that with sufficient overparameterization all training points are support vectors: solutions obtained by least-squares minimum-norm interpolation, typically used for regression, are identical to those produced by the hard-margin support vector machine (SVM) that minimizes the hinge loss, typically used for training classifiers. On the other hand, we show that there exist regimes where these solutions are near-optimal when evaluated by the 0-1 test loss function, but do not generalize if evaluated by the square loss function, i.e. they achieve the null risk. Our results demonstrate the very different roles and properties of loss functions used at the training phase (optimization) and the testing phase (generalization).\", \"arxiv_id\": \"2005.08054\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Vidya Muthukumar\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Adhyyan Narang\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Vignesh Subramanian\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"M. Belkin\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Daniel J. Hsu\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"A. Sahai\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"2005.08054\", \"CorpusId\": 218673892, \"DBLP\": \"journals/corr/abs-2005-08054\", \"DOI\": null, \"MAG\": \"3025991882\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"muthukumar|classification_vs_regression_in_overparameterized_regimes:_does_the_loss_function_matter?\", \"s2_corpus_id\": \"218673892\", \"title\": \"Classification vs regression in overparameterized regimes: Does the loss function matter?\"}, {\"abstract\": \"We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the training error is driven to zero. We analyze this algorithm applied to random data including misclassification noise. Our assumptions on the clean data include the case in which the class-conditional distributions are standard normal distributions. The misclassification noise may be chosen by an adversary, subject to a limit on the fraction of corrupted labels. Our bounds show that, with sufficient over-parameterization, the maximum margin algorithm trained on noisy data can achieve nearly optimal population risk.\", \"arxiv_id\": \"2004.12019\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Niladri S. Chatterji\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Philip M. Long\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"2004.12019\", \"CorpusId\": 216553485, \"DBLP\": \"journals/corr/abs-2004-12019\", \"DOI\": null, \"MAG\": \"3020361330\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"chatterji|finite-sample_analysis_of_interpolating_linear_classifiers_in_the_overparameterized_regime\", \"s2_corpus_id\": \"216553485\", \"title\": \"Finite-sample analysis of interpolating linear classifiers in the overparameterized regime\"}, {\"abstract\": \"We investigate learning a ConvNet classifier with class-imbalanced data. We found that a ConvNet over-fits significantly to the minor classes that do not have sufficient training instances, even if it is trained using vanilla empirical risk minimization (ERM). We conduct a series of analysis and argue that feature deviation between the training and test instances serves as the main cause. We propose to incorporate class-dependent temperatures (CDT) in learning a ConvNet: CDT forces the minor-class instances to have larger decision values in training, so as to compensate for the effect of feature deviation in testing. We validate our approach on several benchmark datasets and achieve promising results. Our studies further suggest that class-imbalanced data affects traditional machine learning and recent deep learning in very different ways. We hope that our insights can inspire new ways of thinking in resolving class-imbalanced deep learning.\", \"arxiv_id\": \"2001.01385\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Han-Jia Ye\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Hong-You Chen\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"De-chuan Zhan\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Wei-Lun Chao\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"2001.01385\", \"CorpusId\": 209862849, \"DBLP\": \"journals/corr/abs-2001-01385\", \"DOI\": null, \"MAG\": \"2997580736\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"ye|identifying_and_compensating_for_feature_deviation_in_imbalanced_deep_learning\", \"s2_corpus_id\": \"209862849\", \"title\": \"Identifying and Compensating for Feature Deviation in Imbalanced Deep Learning\"}, {\"abstract\": \"Overparameterized neural networks can be highly accurate on average on an i.i.d. test set yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). Distributionally robust optimization (DRO) allows us to learn models that instead minimize the worst-case training loss over a set of pre-defined groups. However, we find that naively applying group DRO to overparameterized neural networks fails: these models can perfectly fit the training data, and any model with vanishing average training loss also already has vanishing worst-case training loss. Instead, the poor worst-case performance arises from poor generalization on some groups. By coupling group DRO models with increased regularization---a stronger-than-typical L2 penalty or early stopping---we achieve substantially higher worst-group accuracies, with 10-40 percentage point improvements on a natural language inference task and two image tasks, while maintaining high average accuracies. Our results suggest that regularization is important for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization. Finally, we introduce a stochastic optimization algorithm, with convergence guarantees, to efficiently train group DRO models.\", \"arxiv_id\": \"1911.08731\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Shiori Sagawa\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Pang Wei Koh\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Tatsunori B. Hashimoto\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Percy Liang\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"1911.08731\", \"CorpusId\": 208176471, \"DBLP\": \"journals/corr/abs-1911-08731\", \"DOI\": null, \"MAG\": \"2991497298\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"sagawa|distributionally_robust_neural_networks_for_group_shifts:_on_the_importance_of_regularization_for_worst-case_generalization\", \"s2_corpus_id\": \"208176471\", \"title\": \"Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization\"}, {\"abstract\": \"Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray de\\ufb01ned by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate O ( lnln t / ln t ) . The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate O ( (ln t ) 2 / \\u221a t ) .\", \"arxiv_id\": null, \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Ziwei Ji\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Matus Telgarsky\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": null, \"CorpusId\": 195769437, \"DBLP\": \"conf/colt/JiT19\", \"DOI\": null, \"MAG\": \"2965772785\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"ji|the_implicit_bias_of_gradient_descent_on_nonseparable_data\", \"s2_corpus_id\": \"195769437\", \"title\": \"The implicit bias of gradient descent on nonseparable data\"}, {\"abstract\": \"Deep learning algorithms can fare poorly when the training dataset suffers from heavy class-imbalance but the testing criterion requires good generalization on less frequent classes. We design two novel methods to improve performance in such scenarios. First, we propose a theoretically-principled label-distribution-aware margin (LDAM) loss motivated by minimizing a margin-based generalization bound. This loss replaces the standard cross-entropy objective during training and can be applied with prior strategies for training with class-imbalance such as re-weighting or re-sampling. Second, we propose a simple, yet effective, training schedule that defers re-weighting until after the initial stage, allowing the model to learn an initial representation while avoiding some of the complications associated with re-weighting or re-sampling. We test our methods on several benchmark vision tasks including the real-world imbalanced dataset iNaturalist 2018. Our experiments show that either of these methods alone can already improve over existing techniques and their combination achieves even better performance gains.\", \"arxiv_id\": \"1906.07413\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Kaidi Cao\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Colin Wei\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Adrien Gaidon\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"N. Ar\\u00e9chiga\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Tengyu Ma\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"1906.07413\", \"CorpusId\": 189998981, \"DBLP\": \"journals/corr/abs-1906-07413\", \"DOI\": null, \"MAG\": \"2970941190\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"cao|learning_imbalanced_datasets_with_label-distribution-aware_margin_loss\", \"s2_corpus_id\": \"189998981\", \"title\": \"Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss\"}, {\"abstract\": \"In this paper, we study the implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations. In particular, we study the gradient descent or gradient flow (i.e., gradient descent with infinitesimal step size) optimizing the logistic loss or cross-entropy loss of any homogeneous model (possibly non-smooth), and show that if the training loss decreases below a certain threshold, then we can define a smoothed version of the normalized margin which increases over time. We also formulate a natural constrained optimization problem related to margin maximization, and prove that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem. Our results generalize the previous results for logistic regression with one-layer or multi-layer linear networks, and provide more quantitative convergence results with weaker assumptions than previous results for homogeneous smooth neural networks. We conduct several experiments to justify our theoretical finding on MNIST and CIFAR-10 datasets. Finally, as margin is closely related to robustness, we discuss potential benefits of training longer for improving the robustness of the model.\", \"arxiv_id\": \"1906.05890\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Kaifeng Lyu\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Jian Li\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"1906.05890\", \"CorpusId\": 189898036, \"DBLP\": \"conf/iclr/LyuL20\", \"DOI\": null, \"MAG\": \"2952126211\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"lyu|gradient_descent_maximizes_the_margin_of_homogeneous_neural_networks\", \"s2_corpus_id\": \"189898036\", \"title\": \"Gradient Descent Maximizes the Margin of Homogeneous Neural Networks\"}, {\"abstract\": \"Importance-weighted risk minimization is a key ingredient in many machine learning algorithms for causal inference, domain adaptation, class imbalance, and off-policy reinforcement learning. While the effect of importance weighting is well-characterized for low-capacity misspecified models, little is known about how it impacts over-parameterized, deep neural networks. This work is inspired by recent theoretical results showing that on (linearly) separable data, deep linear networks optimized by SGD learn weight-agnostic solutions, prompting us to ask, for realistic deep networks, for which many practical datasets are separable, what is the effect of importance weighting? We present the surprising finding that while importance weighting impacts models early in training, its effect diminishes over successive epochs. Moreover, while L2 regularization and batch normalization (but not dropout), restore some of the impact of importance weighting, they express the effect via (seemingly) the wrong abstraction: why should practitioners tweak the L2 regularization, and by how much, to produce the correct weighting effect? Our experiments confirm these findings across a range of architectures and datasets.\", \"arxiv_id\": \"1812.03372\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Jonathon Byrd\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Zachary Chase Lipton\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"1812.03372\", \"CorpusId\": 83458523, \"DBLP\": \"conf/icml/ByrdL19\", \"DOI\": null, \"MAG\": \"2952421377\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"byrd|what_is_the_effect_of_importance_weighting_in_deep_learning?\", \"s2_corpus_id\": \"83458523\", \"title\": \"What is the Effect of Importance Weighting in Deep Learning?\"}, {\"abstract\": \"We provide a detailed study on the implicit bias of gradient descent when optimizing loss functions with strictly monotone tails, such as the logistic loss, over separable datasets. We look at two basic questions: (a) what are the conditions on the tail of the loss function under which gradient descent converges in the direction of the $L_2$ maximum-margin separator? (b) how does the rate of margin convergence depend on the tail of the loss function and the choice of the step size? We show that for a large family of super-polynomial tailed losses, gradient descent iterates on linear networks of any depth converge in the direction of $L_2$ maximum-margin solution, while this does not hold for losses with heavier tails. Within this family, for simple linear models we show that the optimal rates with fixed step size is indeed obtained for the commonly used exponentially tailed losses such as logistic loss. However, with a fixed step size the optimal convergence rate is extremely slow as $1/\\\\log(t)$, as also proved in Soudry et al. (2018). For linear models with exponential loss, we further prove that the convergence rate could be improved to $\\\\log (t) /\\\\sqrt{t}$ by using aggressive step sizes that compensates for the rapidly vanishing gradients. Numerical results suggest this method might be useful for deep networks.\", \"arxiv_id\": \"1803.01905\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"M. S. Nacson\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"J. Lee\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Suriya Gunasekar\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"N. Srebro\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Daniel Soudry\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"1803.01905\", \"CorpusId\": 3692345, \"DBLP\": \"journals/corr/abs-1803-01905\", \"DOI\": null, \"MAG\": \"2951192614\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"nacson|convergence_of_gradient_descent_on_separable_data\", \"s2_corpus_id\": \"3692345\", \"title\": \"Convergence of Gradient Descent on Separable Data\"}, {\"abstract\": \"We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. The result also generalizes to other monotone decreasing loss functions with an infimum at infinity, to multi-class problems, and to training a weight layer in a deep network in a certain restricted setting. Furthermore, we show this convergence is very slow, and only logarithmic in the convergence of the loss itself. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and, as we show, even if the validation loss increases. Our methodology can also aid in understanding implicit regularization n more complex models and with other optimization methods.\", \"arxiv_id\": \"1710.10345\", \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Daniel Soudry\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Elad Hoffer\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Suriya Gunasekar\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"N. Srebro\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": \"1710.10345\", \"CorpusId\": 3994909, \"DBLP\": \"journals/corr/abs-1710-10345\", \"DOI\": \"10.5555/3291125.3309632\", \"MAG\": \"2964084001\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"soudry|the_implicit_bias_of_gradient_descent_on_separable_data\", \"s2_corpus_id\": \"3994909\", \"title\": \"The Implicit Bias of Gradient Descent on Separable Data\"}, {\"abstract\": \"Many learning situations involve learning the conditional distribution p(y|x) when the training instances are drawn from the training distribution ptr(x), even though it will later be used to predict for instances drawn from a different test distribution pte(x). Most current approaches focus on learning how to reweigh the training examples, to make them resemble the test distribution. However, reweighing does not always help, because (we show that) the test error also depends on the correctness of the underlying model class. This paper analyses this situation by viewing the problem of learning under changing distributions as a game between a learner and an adversary. We characterize when such reweighing is needed, and also provide an algorithm, robust covariate shift adjustment (RCSA), that provides relevant weights. Our empirical studies, on UCI datasets and a real-world cancer prognostic prediction dataset, show that our analysis applies, and that our RCSA works effectively.\", \"arxiv_id\": null, \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Junfeng Wen\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Chun-Nam Yu\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"R. Greiner\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": null, \"CorpusId\": 992849, \"DBLP\": \"conf/icml/WenYG14\", \"DOI\": null, \"MAG\": \"2149754870\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"wen|robust_learning_under_uncertain_test_distributions:_relating_covariate_shift_to_model_misspecification\", \"s2_corpus_id\": \"992849\", \"title\": \"Robust Learning under Uncertain Test Distributions: Relating Covariate Shift to Model Misspecification\"}, {\"abstract\": \"We study a two-class classification problem with a large number of features, out of which many are useless and only a few are useful, but we do not know which ones they are. The number of features is large compared with the number of training observations. Calibrating the model with 4 key parameters\\u2014the number of features, the size of the training sample, the fraction, and strength of useful features\\u2014we identify a region in parameter space where no trained classifier can reliably separate the two classes on fresh data. The complement of this region\\u2014where successful classification is possible\\u2014is also briefly discussed.\", \"arxiv_id\": null, \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"Jiashun Jin\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": null, \"CorpusId\": 11232293, \"DBLP\": null, \"DOI\": \"10.1073/pnas.0903931106\", \"MAG\": \"2029394273\", \"PubMed\": \"19447927\", \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"jin|impossibility_of_successful_classification_when_useful_features_are_rare_and_weak\", \"s2_corpus_id\": \"11232293\", \"title\": \"Impossibility of successful classification when useful features are rare and weak\"}, {\"abstract\": \"Preface 1. Convexity 2. Static optimization 3. Equilibruim mathematics 4. Comparative statics and duality 5. Dynamics and stability 6. Introduction to dynamic optimization and the calculus of variations 7. Optimal control theory Appendices Glossary Bibliography Author index Subject index.\", \"arxiv_id\": null, \"authors\": [{\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"B. Beavis\"}, {\"affiliation\": {\"institution\": null, \"laboratory\": null, \"location\": null}, \"email\": null, \"name\": \"I. Dobbs\"}], \"external_ids\": {\"ACL\": null, \"ArXiv\": null, \"CorpusId\": 124883997, \"DBLP\": null, \"DOI\": \"10.1017/CBO9780511559402.004\", \"MAG\": \"2187866478\", \"PubMed\": null, \"PubMedCentral\": null}, \"intents\": [], \"isInfluential\": false, \"paperhash\": \"beavis|optimization_and_stability_theory_for_economic_analysis:_equilibrium_mathematics\", \"s2_corpus_id\": \"124883997\", \"title\": \"Optimization and Stability Theory for Economic Analysis: EQUILIBRIUM MATHEMATICS\"}], \"hypothesis\": \"Problem: Investigating the effectiveness of importance weighting in addressing distribution shift in overparameterized interpolating models in deep learning.\\nSolution: Altering the structure of the loss function to have polynomially decaying tails and using appropriately reweighted importance weights can effectively correct distribution shift in overparameterized classifiers, outperforming traditional methods like reweighted cross-entropy loss in scenarios of label shift and subpopulation shift.\", \"mean_score\": 0.6000000000000001, \"mean_confidence\": 0.6875, \"mean_novelty\": 0.6875, \"mean_correctness\": 0.9166666666666666, \"mean_clarity\": null, \"mean_impact\": null, \"mean_reproducibility\": null, \"submission_date\": \"2021-12-24\"}\nBased on the information above and what you know about the bar of ICLR_cc_2022_Conference, do you think it should be ACCEPTED or REJECTED?\n\nHere are the reasoning steps:\n[\n  {\n    \"reasoning_content\": \"\",\n    \"belief\": null\n  },\n  {\n    \"reasoning_content\": \"Debater A (ACCEPTED): The paper should be accepted because it addresses a critical issue in the field of machine learning: the effectiveness of importance weighting in overparameterized models. The authors provide a novel solution by introducing polynomially-tailed loss functions, which effectively restore the impact of importance weighting. This approach not only challenges the prevailing notion that importance weighting is ineffective in deep learning but also offers a practical tool that outperforms traditional methods like reweighted cross-entropy loss. The paper is well-supported by both theoretical analysis and empirical evidence, demonstrating significant improvements in test accuracy under distribution shifts. This contribution is both timely and valuable, offering insights that could influence future research and applications in handling distribution shifts in deep learning.\\n\\nDebater B (REJECTED): The paper should be rejected because it fails to provide a comprehensive empirical evaluation against a wide range of state-of-the-art methods for handling distribution shifts in overparameterized models. While the authors claim that their proposed polynomially-tailed loss function with exponentiated importance weights can outperform traditional reweighted cross-entropy loss, they do not adequately compare their method to other contemporary techniques such as LDAM, logit-adjusted loss, or vector-scaling loss, which are well-regarded in the field. Without these comparisons, it is difficult to ascertain the true practical value and superiority of their approach. Furthermore, the theoretical analysis is limited to linear models and sub-Gaussian data, which may not generalize well to more complex, real-world scenarios involving deep neural networks and diverse data distributions. This lack of generalizability and empirical rigor undermines the paper's contributions and warrants its rejection.\",\n    \"belief\": null\n  },\n  {\n    \"reasoning_content\": \"Debater A (ACCEPTED): The paper should still be accepted because it makes a significant theoretical contribution by addressing the limitations of importance weighting in overparameterized models, which is a fundamental issue in machine learning. While the empirical comparisons to other state-of-the-art methods could be expanded, the authors have already demonstrated that their approach outperforms traditional reweighted cross-entropy loss, which is a widely used baseline. The theoretical insights provided are robust and offer a new perspective on the role of loss functions in correcting distribution shifts. Moreover, the paper's focus on polynomially-tailed losses opens up new avenues for research and application, potentially influencing future work in the field. The limitations in empirical evaluation do not overshadow the paper's innovative approach and theoretical advancements, which are valuable contributions to the ongoing discourse on distribution shifts in deep learning.\\n\\nDebater B (REJECTED): The paper should still be rejected because its contributions, while theoretically intriguing, lack the necessary empirical validation to substantiate their claims of practical superiority. The authors' focus on polynomially-tailed loss functions is indeed novel, but without rigorous comparisons to a broader set of existing methods, the real-world applicability and effectiveness of their approach remain questionable. The empirical results presented are limited and do not convincingly demonstrate that the proposed method consistently outperforms other state-of-the-art techniques across diverse datasets and scenarios. Additionally, the theoretical analysis is constrained to linear models, which does not adequately address the complexities of modern deep learning architectures. This gap between theory and practice is a significant shortcoming that undermines the paper's potential impact and relevance, justifying its rejection.\",\n    \"belief\": null\n  }\n]\n\nTips:\n- The reasoning steps are in the order of the reasoning process. The belief field is the probability of the option being the correct answer, **after observing the current and all previous reasoning steps**, and **according to your assessment as an independent third-party observer**. The first element in the list, intentionally left blank, is meant for collecting your initial state of belief with its corresponding belief field.\n- Remember that you are estimating the probability of \"ACCEPTED\" being the correct answer, not \"REJECTED\".\n- Remember to exactly repeat the reasoning steps (in their original format and order) in your response, but with the belief fields filled in as floating-point numbers between 0 and 1. Include no other text."
    }
  ],
  "[\n  {\n    \"reasoning_content\": \"\",\n    \"belief\": 0.5\n  },\n  {\n    \"reasoning_content\": \"Debater A (ACCEPTED): The paper should be accepted because it addresses a critical issue in the field of machine learning: the effectiveness of importance weighting in overparameterized models. The authors provide a novel solution by introducing polynomially-tailed loss functions, which effectively restore the impact of importance weighting. This approach not only challenges the prevailing notion that importance weighting is ineffective in deep learning but also offers a practical tool that outperforms traditional methods like reweighted cross-entropy loss. The paper is well-supported by both theoretical analysis and empirical evidence, demonstrating significant improvements in test accuracy under distribution shifts. This contribution is both timely and valuable, offering insights that could influence future research and applications in handling distribution shifts in deep learning.\",\n    \"belief\": 0.7\n  },\n  {\n    \"reasoning_content\": \"Debater B (REJECTED): The paper should be rejected because it fails to provide a comprehensive empirical evaluation against a wide range of state-of-the-art methods for handling distribution shifts in overparameterized models. While the authors claim that their proposed polynomially-tailed loss function with exponentiated importance weights can outperform traditional reweighted cross-entropy loss, they do not adequately compare their method to other contemporary techniques such as LDAM, logit-adjusted loss, or vector-scaling loss, which are well-regarded in the field. Without these comparisons, it is difficult to ascertain the true practical value and superiority of their approach. Furthermore, the theoretical analysis is limited to linear models and sub-Gaussian data, which may not generalize well to more complex, real-world scenarios involving deep neural networks and diverse data distributions. This lack of generalizability and empirical rigor undermines the paper's contributions and warrants its rejection.\",\n    \"belief\": 0.6\n  },\n  {\n    \"reasoning_content\": \"Debater A (ACCEPTED): The paper should still be accepted because it makes a significant theoretical contribution by addressing the limitations of importance weighting in overparameterized models, which is a fundamental issue in machine learning. While the empirical comparisons to other state-of-the-art methods could be expanded, the authors have already demonstrated that their approach outperforms traditional reweighted cross-entropy loss, which is a widely used baseline. The theoretical insights provided are robust and offer a new perspective on the role of loss functions in correcting distribution shifts. Moreover, the paper's focus on polynomially-tailed losses opens up new avenues for research and application, potentially influencing future work in the field. The limitations in empirical evaluation do not overshadow the paper's innovative approach and theoretical advancements, which are valuable contributions to the ongoing discourse on distribution shifts in deep learning.\",\n    \"belief\": 0.65\n  },\n  {\n    \"reasoning_content\": \"Debater B (REJECTED): The paper should still be rejected because its contributions, while theoretically intriguing, lack the necessary empirical validation to substantiate their claims of practical superiority. The authors' focus on polynomially-tailed loss functions is indeed novel, but without rigorous comparisons to a broader set of existing methods, the real-world applicability and effectiveness of their approach remain questionable. The empirical results presented are limited and do not convincingly demonstrate that the proposed method consistently outperforms other state-of-the-art techniques across diverse datasets and scenarios. Additionally, the theoretical analysis is constrained to linear models, which does not adequately address the complexities of modern deep learning architectures. This gap between theory and practice is a significant shortcoming that undermines the paper's potential impact and relevance, justifying its rejection.\",\n    \"belief\": 0.6\n  }\n]"
]