{
  "metadata": {
    "forum_id": "H1gEP6NFwr",
    "review_id": "SJx0zByAYr",
    "rebuttal_id": "B1iEvPOjB",
    "title": "On the Tunability of Optimizers in Deep Learning",
    "reviewer": "AnonReviewer1",
    "rating": 3,
    "conference": "ICLR2020",
    "permalink": "https://openreview.net/forum?id=H1gEP6NFwr&noteId=B1iEvPOjB",
    "annotator": "anno10"
  },
  "review_sentences": [
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 0,
      "text": "The main contributions of the submission are:",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 1,
      "text": "1. A comprehensive empirical comparison of deep learning optimizers, with their performance compared under different amount of hyper-parameter tuning (they perform hyper-parameter tuning using random search).",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 2,
      "text": "2. The introduction of a novel metric that tries to capture the \"tunability\" of an optimizer.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 3,
      "text": "This metric attempts to trade off the performance of an optimizer when tuned only with a small number of hyper-parameter trials, and its performance when carefully tuned.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 4,
      "text": "The metric is defined as a weighted average of the performance after tuning with i random trials, with i that goes from 1 to K.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 5,
      "text": "The weights of this weighted average and K are \"hyper-parameters\" of the metric itself.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 6,
      "text": "They use K=100 and suggest 3 possible choices of weights.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 7,
      "text": "The paper appears to treat 2. as the main contribution.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 8,
      "text": "However, I do not think the metric they introduce is good enough to be recommended in future work, when comparing tunability of optimizers (or other algorithms with hyperparameters).",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 9,
      "text": "The reason is that simpler methods provide just as much information, and do not rely on the need of interpreting the choice of the weights and K. This point is proven in the paper itself, where for example Figure 2 provides a more concrete and easier to interpret information than the tunability metric, similar graphs could be easily provided per dataset.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 10,
      "text": "Similarly, figure 3 as well as figures 5-7 and 8 in the appendix provide very good information about the tunability of the various optimizers without using the introduced metrics.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 11,
      "text": "Information similar (although not identical) to that summarized in table 5 could be captured by substituting the 3 metrics with the best performance after tuning for 4, 16 and 64 iterations respectively (just as examples).",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 12,
      "text": "A stronger contribution is 1., which however is somewhat incremental compared to similar comparisons made in the past.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_negative"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 13,
      "text": "Comparisons which, while mentioned, should perhaps have been discussed and compared more in detail in this work.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_negative"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 14,
      "text": "Overall, I do not feel the comparisons dramatically change the qualitative understanding the field has of the different optimizers and their tunability.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_motivation-impact",
      "polarity": "pol_negative"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 15,
      "text": "They also suggest that when the tuning budget is low, using Adam but tuning only the learning rate is beneficial, which could be a valuable and practical suggestion.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 16,
      "text": "I enjoyed reading the submission, which is very clearly written, but due to the relatively limited value of the contributions, and excessive focus on the tunability metric which I do not feel is giustified, I slightly lean against acceptance here at ICLR. I do think, however, that it would make a great submission to a smaller venue or workshop.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 17,
      "text": "Other comments/notes:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 18,
      "text": "* One aspects that is mostly left out of the discussion (except from one side comment) is the wallclock time, as some optimizers might be on average quicker to train (for example due to quicker convergence), this can easily lead it to be quicker to tune even though it requires a higher budget of trials.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 19,
      "text": "I think it would be worth discussing this more.",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 20,
      "text": "* minor: in figure 8 in the appendix, the results after 100 iterations is, as far as I understand, over a single replication, so is not particularly reliable (and will always be 100% of a single optimizer)",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "SJx0zByAYr",
      "sentence_index": 21,
      "text": "* similarly to the above, if the configurations are always sampled from the same 100, confidence intervals in the graphs become less reliable as the budget increases.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 0,
      "text": "Thank you for your review of our work.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 1,
      "text": "The following are your concerns of our work:",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 2,
      "text": "a. Limited contribution to the qualitative understanding of the optimizers",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 3,
      "text": "a.i. Informativeness of the proposed w-tunability metric",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 4,
      "text": "b. Using wall-clock time instead of number of HPO oracle calls",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 5,
      "text": "We address these concerns one-by-one.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 6,
      "text": "a. Limited contribution to the qualitative understanding of the optimizers:",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 7,
      "text": "We consider the three main contributions of our work to be 1) a systematic evaluation protocol of optimizers, with off-the-shelf HPO to account for the cost of tuning of hyperparameters.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 8,
      "text": "This is missing in existing papers, which consider best attained performance alone.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 9,
      "text": "The importance of a proper hyperparameter search protocol is emphasized by Choi et al., 2019 (published after our submission and under review at ICLR).",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 10,
      "text": "2) a \u201cw-tunability\u201d measure of the cost of hyperparameter optimization, and 3) under the experiments considered we find that Adam (with default beta and epsilon values) is the most tunable.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 11,
      "text": "a.i. Informativeness of the tunability metric:",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 12,
      "text": "We propose w-tunability as a metric to incorporate the HPO tuning too in reporting the performance of an optimizer, and compute it as a linear combination of the incumbents of the HPO algorithm, though one can use an arbitrarily complex function trading off interpretability.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 13,
      "text": "It is true that Figures 4-7 essentially contain all the information needed to judge about the optimizer\u2019s tunability.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 14,
      "text": "However, a metric that is easy to compute, interpret and compare optimizers across tasks is crucial, for which we propose w-tunability.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 15,
      "text": "This is analogous to computing specific quantities like accuracy, FPR, TPR from the confusion matrix, even though a confusion matrix contains all the information (and is quite cumbersome to compare).",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 16,
      "text": "The summary metric in Figure 2 provides a different interpretation: It reports a normalized performance i.e., the  normalized incumbent performance at iteration $k$. This doesn\u2019t explicitly include information about the previous $k-1$ iterations (which is our central argument).",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 17,
      "text": "Thus our proposed tunability metric provides more information than the summary statistics plot (figure 2).",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 18,
      "text": "Due to this novelty, we argue that our setup does contribute to the qualitative understanding of the optimizers.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 19,
      "text": "In fact, it yields to a drastically different valuation of adaptive gradient methods than popular previous work (Wilson et al, Shah et al).",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 20,
      "text": "You mention that our work is incremental to the work on benchmarking of optimizers.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 21,
      "text": "Can you please provide respective references?",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_followup",
      "alignment": [
        "context_sentences",
        [
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 22,
      "text": "We have modified parts of our paper to reflect these arguments better.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          9,
          10,
          11
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 23,
      "text": "b. Using wall-clock time instead of HPO oracle calls:",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 24,
      "text": "Our reason for using a number of configuration trials instead of a time budget is that measuring number of hyperparameter configuration searches required is more relevant to understand the optimizers\u2019 dependence on the hyperparameters.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 25,
      "text": "However, we completely agree with you that computational budget is a relevant factor from the practitioner\u2019s point of view, and added a discussion of this in Appendix E of the paper.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          18
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 26,
      "text": "As you rightfully point out, the adaptive optimizers tend to converge in fewer number of epochs, amplifying the results that favor Adam over the variants of SGD.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 27,
      "text": "References:",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 28,
      "text": "Wilson, Ashia C., et al. \"The marginal value of adaptive gradient methods in machine learning.\" Advances in Neural Information Processing Systems. 2017.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 29,
      "text": "Shah, Vatsal, Anastasios Kyrillidis, and Sujay Sanghavi. \"Minimum norm solutions do not always generalize well for over-parameterized problems.\" arXiv preprint arXiv:1811.07055 (2018).",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "SJx0zByAYr",
      "rebuttal_id": "B1iEvPOjB",
      "sentence_index": 30,
      "text": "Choi, Dami, et al. \"On Empirical Comparisons of Optimizers for Deep Learning.\" arXiv preprint arXiv:1910.05446 (2019).",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    }
  ]
}