{
  "metadata": {
    "forum_id": "ryGWhJBtDB",
    "review_id": "r1l1CEFwKr",
    "rebuttal_id": "ryltuFZ7sr",
    "title": "Hyperparameter Tuning and Implicit Regularization in Minibatch SGD",
    "reviewer": "AnonReviewer1",
    "rating": 3,
    "conference": "ICLR2020",
    "permalink": "https://openreview.net/forum?id=ryGWhJBtDB&noteId=ryltuFZ7sr",
    "annotator": "anno10"
  },
  "review_sentences": [
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 0,
      "text": "The paper attempts to clarify the debate on large-batch neural network training, particularly on the relationship between learning rate, batch sizes and test performance.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 1,
      "text": "The authors claim two contributions towards understanding how the hyper-parameters of SGD affect final training and test performance: (1) SGD exhibits two regimes with different behaviours and (2) large-batch training leads to degradation of test performance even with same step budgets.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 2,
      "text": "Overall, the authors did a comprehensive study on large-batch training with the support of extensive experiments.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 3,
      "text": "But I'm concerned with the novelty and contributions of this paper.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 4,
      "text": "I tend to reject this paper because (1) the first contribution of the paper is not new as it has already been recognized by a few paper that SGD exhibits two different regimes; (2) this paper makes the debate of large-batch training even muddier.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 5,
      "text": "Main argument:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 6,
      "text": "The paper does not do a great job in clarify the debate.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_motivation-impact",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 7,
      "text": "Particularly, the authors mixed their observations up with the results of published works, making it hard to identify the contributions of this paper.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 8,
      "text": "For example, the two regimes mentioned in the paper has been identified by a few other works and the contribution of this paper is just to verify them again.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 9,
      "text": "Also, I find the experiments done in section 3 and 4 are similar to previous works and even the conclusions are similar.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 10,
      "text": "The only new observation I'm aware of in these two sections is that the training loss and test accuracy are independent of batch size in the noise dominated regime.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 11,
      "text": "Back to introduction section, the goal of this paper (as claimed in the beginning of second paragraph) is to clarify the debate.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 12,
      "text": "But does this paper really achieves this goal?",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_motivation-impact",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 13,
      "text": "In terms of learning rate scaling, this paper gets similar conclusions as Shallue et al. (2018).",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 14,
      "text": "In terms of the difference between vanilla SGD and SGD with momentum, Zhang et al. (2019) already argued that the difference depends on specific batch sizes and SGD with momentum only outperforms SGD in the curvature dominated regime.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 15,
      "text": "I think the authors should instead focus on the discussion of generalization performance and the observation that training loss and test accuracy are independent of batch size in noise dominated regime.",
      "suffix": "",
      "review_action": "arg_social",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 16,
      "text": "To my knowledge, this part is novel and interesting.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 17,
      "text": "In summary, I'm inclined to reject this paper given the current version.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "arg_other",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1l1CEFwKr",
      "sentence_index": 18,
      "text": "However, I think the paper is still worth reading if the authors can reorganize the paper and I might increase my score if my concerns get resolved.",
      "suffix": "",
      "review_action": "arg_social",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 0,
      "text": "We thank the reviewer for their helpful comments.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 1,
      "text": "Please could the reviewer clarify why they felt our work muddies the debate regarding large-batch training?",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_followup",
      "alignment": [
        "context_sentences",
        [
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 2,
      "text": "We demonstrate that one can initially increase the batch size with no loss in test accuracy by simultaneously increasing the learning rate.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 3,
      "text": "However for very large batch sizes the test accuracy degrades under both constant epoch and constant step budgets.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 4,
      "text": "We agree that some of our observations under constant epoch budgets in sections 2 and 3 have been made in previous work.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          3,
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 5,
      "text": "However there are also several important differences:",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          3,
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 6,
      "text": "1.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 7,
      "text": "Our paper is the first to relate the two regimes of SGD to the popular analogy between SGD and stochastic differential equations (SDEs).",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          6,
          7,
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 8,
      "text": "As we show in sections 4 and 5, this perspective is crucial to understanding the influence of batch size and learning rate on test accuracy.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          6,
          7,
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 9,
      "text": "A common criticism of this analogy is that SGD noise is not Gaussian when the batch size is small.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          6,
          7,
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 10,
      "text": "To our knowledge, we are the first to show that the analogy between SGD and SDEs holds for non-Gaussian short-tailed noise (appendix B).",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          6,
          7,
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 11,
      "text": "2. Zhang et al. argued that Momentum only helps in the large batch limit.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 12,
      "text": "However, their analysis is based on the noisy quadratic model, which cannot explain the results we observed on the test set in sections 4 and 5.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 13,
      "text": "These experiments clearly demonstrate that, unlike the SDE perspective, the noisy quadratic model is not an appropriate model for predicting test set performance in deep learning.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 14,
      "text": "Their work also does not clarify the assumptions under which linear scaling of the learning rate should arise.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 15,
      "text": "3. Our empirical results in section 3 are similar to Shallue et al., however their work argues that there is no reliable relationship between learning rate and batch size.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 16,
      "text": "We draw a very different conclusion: the learning rate usually obeys linear scaling, but linear scaling only holds theoretically when the assumptions we specify are satisfied.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 17,
      "text": "Linear scaling may not hold in cases where these assumptions break down (e.g., language modelling).",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 18,
      "text": "4. The observation that the test accuracy is independent of batch size in the noise dominated regime is a natural consequence of the SDE analogy, since any two training runs which integrate the same SDE should sample final parameters from the same probability distribution.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          15,
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 19,
      "text": "We will clarify this in the updated text.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_by-cr",
      "alignment": [
        "context_sentences",
        [
          15,
          16
        ]
      ],
      "details": {
        "manuscript_change": true
      }
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 20,
      "text": "Two reviewers complained that it was difficult to tell from the text which contributions are novel and which also appear in previous works.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 21,
      "text": "We apologise for this.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 22,
      "text": "It was not our intention and we will edit sections 1 and 2 to ensure that this is resolved and that the above points are reflected in the text.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_by-cr",
      "alignment": [
        "context_global",
        null
      ],
      "details": {
        "manuscript_change": true
      }
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 23,
      "text": "Turning to our generalization experiments in sections 4 and 5.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 24,
      "text": "It is true that a number of papers in recent years have claimed that SGD noise enhances generalization.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 25,
      "text": "However Shallue et al. recently argued no previous work had provided convincing empirical evidence for this claim.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 26,
      "text": "Indeed in their abstract, they state \u2018We find no evidence that larger batch sizes degrade out-of-sample performance\u2019.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 27,
      "text": "In another recent paper, Zhang et al. argued that optimization in deep learning is well described by a noisy quadratic model which predicts that increasing the batch size should always enhance performance under constant step budgets.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 28,
      "text": "Crucially, to establish that SGD noise enhances generalization, one must show that small batch sizes generalize better than large batch sizes under constant step budgets, with realistic learning rate decay schedules, and one must independently tune the learning rate at each batch size.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 29,
      "text": "In section 4, we are the first authors to perform this experiment and confirm that the final test accuracy of SGD does degrade for very large batch sizes under both constant epoch and constant step budgets, contradicting the claims of both Shallue et al and Zhang et al.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 30,
      "text": "Furthermore, we show in section 5 that the optimal SGD temperature which maximizes the test accuracy is almost independent of the epoch budget.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 31,
      "text": "These results provide the first convincing empirical evidence that SGD noise does enhance generalization in well-tuned networks with learning rate decay schedules.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1l1CEFwKr",
      "rebuttal_id": "ryltuFZ7sr",
      "sentence_index": 32,
      "text": "We believe this is an important contribution.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    }
  ]
}