{
  "metadata": {
    "forum_id": "ryGWhJBtDB",
    "review_id": "rJxkq6waYr",
    "rebuttal_id": "HkemB2WXiB",
    "title": "Hyperparameter Tuning and Implicit Regularization in Minibatch SGD",
    "reviewer": "AnonReviewer2",
    "rating": 3,
    "conference": "ICLR2020",
    "permalink": "https://openreview.net/forum?id=ryGWhJBtDB&noteId=HkemB2WXiB",
    "annotator": "anno13"
  },
  "review_sentences": [
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 0,
      "text": "This paper studies the properties of SGD as a function of batch size and learning rate.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 1,
      "text": "Authors argue that SGD has two regimes:  a noise dominated regime (small batch size) and curvature dominated regime (large batch size).",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 2,
      "text": "Authors conduct through numerical experiments highlighting how learning rate changes as a function of batch size (initially linear growth and then saturates).",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 3,
      "text": "The critical contribution of this work appears to be the observation that large batch size can be worse than small under same number of steps demonstrating implicit regularization of small batch size.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 4,
      "text": "The two regime claim of the paper is not really novel.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_negative"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 5,
      "text": "These regimes are fairly well covered by previous works (e.g. Belkin et al as well as others).",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_negative"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 6,
      "text": "When it comes to experiments, constant epoch budget is also fairly well understood and the behavior in Figure 1 is not really surprising (as the eventual training performance gets worse with large batches).",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_negative"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 7,
      "text": "The interesting part in my opinion is the experiments on constant steps.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 8,
      "text": "Authors verify large batch size reduces test accuracy while improving train.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_positive"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 9,
      "text": "I believe these experiments are novel and the results are interesting.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 10,
      "text": "Besides CIFAR 10, authors test this hypothesis in two other datasets while tuning the learning rate.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 11,
      "text": "On the other hand, contribution is somewhat incremental given observations made by related literature (Keskar et al and others).",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "none"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 12,
      "text": "Some remarks:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 13,
      "text": "1) In Table 1, batch size 16k has effective LR of 32.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 14,
      "text": "However in Figure 1c SGD with momentum at batch size 8k uses an effective LR of 4.",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_clarification",
      "aspect": "asp_clarity",
      "polarity": "none"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 15,
      "text": "Can you explain this inconsistency i.e. why is there such a huge jump from 4 to 32 (in reality we expect the effective LR to stay constant in the curvature regime).",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_clarity",
      "polarity": "none"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 16,
      "text": "I also understand that one is constant epoch and other is constant step.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 17,
      "text": "However 4 to 32 seems a bit inconsistent.",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_clarification",
      "aspect": "asp_clarity",
      "polarity": "none"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 18,
      "text": "2) Does momentum help in constant step budget (with sufficiently large steps so that training loss is small)?",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_clarification",
      "aspect": "asp_clarity",
      "polarity": "none"
    },
    {
      "review_id": "rJxkq6waYr",
      "sentence_index": 19,
      "text": "3) Readability: Consider explaining what is meant by \"warm-up\", \"epoch budget\", \"step budget\" clearly and upfront.",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_clarification",
      "aspect": "asp_clarity",
      "polarity": "none"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 0,
      "text": "We thank the reviewer for their helpful comments.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 1,
      "text": "We agree that our most surprising results are for SGD under constant step budgets or unlimited epoch budgets.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_accept-praise",
      "alignment": [
        "context_sentences",
        [
          7,
          8,
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 2,
      "text": "However the behaviour of SGD under constant epoch budgets has generated a lot of debate in the literature in recent years, and we felt it was important to address this simple case first.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_accept-praise",
      "alignment": [
        "context_sentences",
        [
          7,
          8,
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 3,
      "text": "We agree that some of the observations in sections 2 and 3 have already been made in previous work, however there are also several important differences:",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          4,
          5,
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 4,
      "text": "1. Ma, Bassily and Belkin also introduced the notion of two regimes, however their theory holds for convex losses in the interpolating regime.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          4,
          5,
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 5,
      "text": "We will discuss their contribution explicitly in the updated text.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          4,
          5,
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 6,
      "text": "Our discussion in section 2 clarifies why the two regimes arise in practical deep learning models for which these conditions may not hold.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          4,
          5,
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 7,
      "text": "2.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          4,
          5,
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 8,
      "text": "Our paper is the first to relate the two regimes of SGD to the popular analogy between SGD and stochastic differential equations (SDEs).",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          4,
          5,
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 9,
      "text": "As we show in later sections, this perspective is crucial to understanding the influence of batch size and learning rate on test accuracy.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          4,
          5,
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 10,
      "text": "A common criticism of this analogy is that SGD noise is not Gaussian when the batch size is small.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          4,
          5,
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 11,
      "text": "To our knowledge, we are the first to show that the analogy between SGD and SDEs holds for non-Gaussian short-tailed noise (appendix B).",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          4,
          5,
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 12,
      "text": "3. We clarify the differences to some other recent papers in our reply to reviewer 1.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          4,
          5,
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 13,
      "text": "Two reviewers complained that it was difficult to tell from the text which contributions are novel and which also appear in previous works.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          4,
          5,
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 14,
      "text": "We apologise for this.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          4,
          5,
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 15,
      "text": "It was not our intention and we will edit sections 1 and 2 to ensure that this is resolved and that the above points are reflected in the text.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          4,
          5,
          6,
          7,
          8,
          9,
          10,
          11
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 16,
      "text": "Turning to our generalization experiments in sections 4 and 5.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          4,
          5,
          6,
          7,
          8,
          9,
          10,
          11
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 17,
      "text": "We agree that many authors have proposed that SGD noise enhances generalization.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 18,
      "text": "Most notably, Keskar et al. argued that large minibatches perform worse than small minibatches on the test set, even when both achieve similar performance on the training set.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 19,
      "text": "However their experiments do not provide convincing evidence for this claim, because they tuned the learning rate with small batches and then used the same learning rate value with large batches.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 20,
      "text": "A convincing experiment should independently tune the learning rate at all batch sizes under a constant step budget, and it should use a realistic learning rate decay schedule.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 21,
      "text": "Indeed, Shallue et al. recently argued that no existing paper has provided convincing evidence that small batch sizes generalize better than large batch sizes under constant step budgets, and they state in their abstract \u2018We find no evidence that larger batch sizes degrade out-of-sample performance\u2019.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 22,
      "text": "Meanwhile, Zhang et al. argued that optimization in deep learning is well described by a noisy quadratic model which predicts that increasing the batch size should always enhance performance under constant step budgets.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 23,
      "text": "To our knowledge, our experimental results in section 4 are the first to provide convincing evidence that very large minibatches do perform worse than small batch sizes on the test set, even under constant step budgets and when the learning rate is independently tuned.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 24,
      "text": "We believe this is an important contribution.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 25,
      "text": "Meanwhile, our results in section 5 suggest that SGD has an optimal temperature early in training which promotes generalization and is independent of the epoch budget.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 26,
      "text": "In response to the reviewer\u2019s specific comments:",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 27,
      "text": "1) Looking at Figure 1c, while the optimal learning rate at 8k with Momentum is 4, the error bars at this batch size range from 4 to 32.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          13,
          14,
          15,
          16,
          17
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 28,
      "text": "These error bars can be very large in the curvature regime, precisely because the optimal learning rate is close to instability.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          13,
          14,
          15,
          16,
          17
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 29,
      "text": "2) Yes, Momentum will help under constant step budgets if the batch size is large, since it enables us to achieve larger effective learning rates which are beneficial for generalization.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 30,
      "text": "We will add additional experiments to the text to clarify this.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          18
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "rJxkq6waYr",
      "rebuttal_id": "HkemB2WXiB",
      "sentence_index": 31,
      "text": "3) We will clarify the meaning of warm up, epoch budget and step budget as requested.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_by-cr",
      "alignment": [
        "context_sentences",
        [
          19
        ]
      ],
      "details": {
        "manuscript_change": true
      }
    }
  ]
}