{
  "metadata": {
    "forum_id": "rkxd2oR9Y7",
    "review_id": "SJgzxEO5hQ",
    "rebuttal_id": "S1xoNIkzAQ",
    "title": "The Case for Full-Matrix Adaptive Regularization",
    "reviewer": "AnonReviewer3",
    "rating": 6,
    "conference": "ICLR2019",
    "permalink": "https://openreview.net/forum?id=rkxd2oR9Y7&noteId=S1xoNIkzAQ",
    "annotator": "anno2"
  },
  "review_sentences": [
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 0,
      "text": "adaptive versions of sgd are commonly used in machine learning.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 1,
      "text": "adagrad, adadelta are both popular adaptive variations of sgd.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 2,
      "text": "These algorithms can be seen as preconditioned versions of gradient descent where the preconditioner applied is a matrix of second-order moments of the gradients.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 3,
      "text": "However, because this matrix turns out to be a pxp matrix where p is the number of parameters in the model, maintaining and performing linear algebra with this pxp matrix is computationally intensive.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 4,
      "text": "In this paper, the authors show how to maintain and update this pxp matrix by storing only smaller matrices of size pxr and rxr, and performing 1. an SVD of a small matrix of size rxr",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 5,
      "text": "2. matrix-vector multiplication between a pxr matrix and rx1 vector.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 6,
      "text": "Given that rxr is a small constant sized matrix and that matrix-vector multiplication can be efficiently computed on GPUs, this matrix adapted SGD can be made scalable.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 7,
      "text": "The authors also discuss how to adapt the proposed algorithm with Adam style updates that incorporate momentum.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 8,
      "text": "Experiments are shown on various architectures (CNN, RNN) and comparisons are made against SGD, ADAM.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 9,
      "text": "General comments: THe appendix has some good discussion and it would be great if some of that discussion was moved to the main paper.",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "pol_positive"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 10,
      "text": "Pros:  Shows how to make full matrix preconditioning efficient, via the use of clever linear algebra, and GPU computations.",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_motivation-impact",
      "polarity": "pol_positive"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 11,
      "text": "Shows improvements on LSTM tasks, and is comparable with SGD, matching accuracy with time.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_positive"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 12,
      "text": "Cons: While doing this leads to better convergence, each update is still very expensive compared to standard SGD, and for instance on vision tasks the algorithm needs to run for almost double the time to get similar accuracies as an SGD, adam solver.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_replicability",
      "polarity": "pol_negative"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 13,
      "text": "This means that it is not apriori clear if using this solver instead of standard SGD, ADAM is any good.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 14,
      "text": "It might be possible that if one performs few steps of GGT optimizer in the initial stages and then switches to SGD/ADAM in the later stages, then some of the computational concerns that arise are eliminated.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "SJgzxEO5hQ",
      "sentence_index": 15,
      "text": "Have the authors tried out such techniques?",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_clarity",
      "polarity": "none"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "SJgzxEO5hQ",
      "rebuttal_id": "S1xoNIkzAQ",
      "sentence_index": 0,
      "text": "Thanks for the review.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "SJgzxEO5hQ",
      "rebuttal_id": "S1xoNIkzAQ",
      "sentence_index": 1,
      "text": "@Update overhead: We argue that per-iteration performance is a worthwhile objective in itself, which is less significant in some scenarios (e.g. costly function evaluation, like in RL, or expensive backprops, like in RNNs).",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJgzxEO5hQ",
      "rebuttal_id": "S1xoNIkzAQ",
      "sentence_index": 2,
      "text": "That said, we were indeed not able to demonstrate end-to-end gains in vision.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJgzxEO5hQ",
      "rebuttal_id": "S1xoNIkzAQ",
      "sentence_index": 3,
      "text": "Please note that in the NLP benchmark our algorithm finds a better solution and wins in wall-clock time.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_sentences",
        [
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "SJgzxEO5hQ",
      "rebuttal_id": "S1xoNIkzAQ",
      "sentence_index": 4,
      "text": "@Switching: This is a good suggestion, and we indeed do cite one of the papers attempting to approach optimizer-switching in a principled way.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          14,
          15
        ]
      ],
      "details": {
        "request_out_of_scope": false
      }
    },
    {
      "review_id": "SJgzxEO5hQ",
      "rebuttal_id": "S1xoNIkzAQ",
      "sentence_index": 5,
      "text": "We found that we could squeeze out some wall-clock gains by applying the expensive update more sparingly, but the value of including this in the paper was unclear (effectively adding a host of hyperparameters orthogonal to the central idea).",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          14,
          15
        ]
      ],
      "details": {}
    }
  ]
}