{
  "metadata": {
    "forum_id": "rJlqoTEtDB",
    "review_id": "S1gTSur19S",
    "rebuttal_id": "BJgxTZgSir",
    "title": "PowerSGD: Powered Stochastic Gradient Descent Methods for Accelerated Non-Convex Optimization",
    "reviewer": "AnonReviewer3",
    "rating": 8,
    "conference": "ICLR2020",
    "permalink": "https://openreview.net/forum?id=rJlqoTEtDB&noteId=BJgxTZgSir",
    "annotator": "anno10"
  },
  "review_sentences": [
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 0,
      "text": "This paper proposes, analyzes, and empirically evaluates PowerSGD (and a version with momentum), a simple adjustment to standard SGD algorithms that alleviates issues caused by poorly scaled gradients in SGD.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 1,
      "text": "The rates in the theoretical analysis are competitive with those for standard SGD, and the empirical results argue that PowerSGD algorithms are competitive with widely used adaptive methods such as Adam and RMSProp, suggesting that PowerSGD may be a useful addition to the armory of adaptive SGD algorithms.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 2,
      "text": "Overall I recommend acceptance of this paper, although I think there may be a couple of places where the authors overclaim a bit on the theoretical side.",
      "suffix": "",
      "review_action": "arg_social",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 3,
      "text": "Specifically:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 4,
      "text": "\u2022",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 5,
      "text": "The convergence analysis assumes a batch size equal to T, the number of steps of PowerSGD.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 6,
      "text": "This implies that the amount of work (in FLOPs) done by the algorithm (at least the version being analyzed) is quadratic in T, which makes the convergence rates a bit misleading.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 7,
      "text": "If one reframes the convergence rate in terms of FLOPs U=T^2 instead of iterations, then the convergence rate drops from 1/T to 1/sqrt(U), which undermines the claim in remark 3.4 that this analysis is superior to that of Yan et al. (2018).",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 8,
      "text": "\u2022",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 9,
      "text": "In Remark 3.4.3, the authors claim that another point of difference between their results and Yan et al.'s (2018) is that Yan et al. assume bounded gradients, an assumption that is not satisfied for e.g., mean squared error (MSE).",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_quote",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 10,
      "text": "But a very similar assumption is hidden in the bounded-gradient-variance assumption Assumption 3.2; for example, Assumption 3.2 is clearly not satisfied by the least-squares regression problem",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 11,
      "text": "min_\u03b2 (1/N)\u03a3_n (y_n \u2013 x_n \u2022\u00a0\u03b2)^2",
      "suffix": "\n",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 12,
      "text": "with the minibatch gradient estimator computed over randomly chosen minibatches B:",
      "suffix": "\n",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 13,
      "text": "\\hat g = (1/|B|) \u03a3_{n \\in B} x_n (y_n \u2013 x_n \u2022\u00a0\u03b2).",
      "suffix": "\n",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 14,
      "text": "As the norm of \u03b2 goes to infinity, so does the expected norm of the error of \\hat g. I'm not saying this is a particularly big",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 15,
      "text": "deal, just that it's not an improvement over Yan et al.'s result.",
      "suffix": "\n\n",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 16,
      "text": "That aside, this seems like good work that could have a significant impact on practice.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_motivation-impact",
      "polarity": "pol_positive"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 17,
      "text": "A couple of other minor points:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 18,
      "text": "\u2022",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 19,
      "text": "It looks like neither the experiments nor Theorem 3.2 show any benefit to PowerSGDM over PowerSGD.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 20,
      "text": "It would be nice to see some discussion (or at least speculation) on why that is.",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 21,
      "text": "\u2022\u00a0Not all of the arrows in Figure 1 are pointing to the right lines.",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "S1gTSur19S",
      "sentence_index": 22,
      "text": "\u2022\u00a0In the abstract, it might be good to clarify that the exponentiation is elementwise.",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 0,
      "text": "We thank the reviewer for the positive assessment of our work.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 1,
      "text": "We would like start by stating that we did not mean to claim that the rate of convergence proved in this paper is better that than of Yan et al.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 2,
      "text": "We have modified the Remarks to clarify the statements.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 3,
      "text": "In the stochastic gradient setting, the number of gradient evaluation is indeed $T^2$. This is consistent with the result in Bernstein et al. (2018).",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 4,
      "text": "The main point we would like to make is that the bounds are very concise and exactly reduce to that of gradient descent/stochastic gradient descent in the special cases.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 5,
      "text": "We thank you for pointing out that the bounded variance assumption may also be restrictive and only satisfied on bounded domains.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 6,
      "text": "It is nonetheless a standard assumption made in the literature.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 7,
      "text": "We have modified Remark 3.4 (and added Remark 3.5) to make this clear in the updated version.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 8,
      "text": "Response to other minor points:",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          17,
          19,
          20,
          21,
          22
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 9,
      "text": "Our convergence analysis is done for non-convex objective functions (similar to that of Yan et al. and Bernstein et al.).",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          19,
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 10,
      "text": "In the non-convex setting, to the best of our knowledge, there are no theoretical results that show benefits of momentum methods over SGD.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          19,
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 11,
      "text": "For experiments, we speculate that the reason is that the batch size used is too small for (Powered)SGDM to gain an advantage over (Powered)SGD.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          19,
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 12,
      "text": "We plan to add SGD as a reference algorithm (as suggested by another reviewer).",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_by-cr",
      "alignment": [
        "context_sentences",
        [
          19,
          20
        ]
      ],
      "details": {
        "manuscript_change": true
      }
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 13,
      "text": "Once the experiments are complete, we should be able to see how SGDM compares with SGD in the experiments.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          19,
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 14,
      "text": "This may take a while for the ImageNet experiments, but we promise to do so in the final version.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_by-cr",
      "alignment": [
        "context_sentences",
        [
          19,
          20
        ]
      ],
      "details": {
        "manuscript_change": true
      }
    },
    {
      "review_id": "S1gTSur19S",
      "rebuttal_id": "BJgxTZgSir",
      "sentence_index": 15,
      "text": "We have fixed the other issues you raised in your other minor comments. If you have any further comments, please let us know.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_by-cr",
      "alignment": [
        "context_sentences",
        [
          21,
          22
        ]
      ],
      "details": {
        "manuscript_change": true
      }
    }
  ]
}