{
  "metadata": {
    "forum_id": "rkxd2oR9Y7",
    "review_id": "BJgICDN92m",
    "rebuttal_id": "r1e2sLyfRm",
    "title": "The Case for Full-Matrix Adaptive Regularization",
    "reviewer": "AnonReviewer2",
    "rating": 5,
    "conference": "ICLR2019",
    "permalink": "https://openreview.net/forum?id=rkxd2oR9Y7&noteId=r1e2sLyfRm",
    "annotator": "anno2"
  },
  "review_sentences": [
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 0,
      "text": "The paper considers adaptive regularization, which has been popular in neural network learning.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 1,
      "text": "Rather than adapting diagonal elements of the adaptivity matrix, the paper proposes to consider a low-rank approximation to the Gram/correlation matrix.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 2,
      "text": "When you say that full-matrix computation \"requires taking the inverse square root\", I assume you know that is not really correct?",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_clarification",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 3,
      "text": "As a matter of good implementation, one never takes the inverse of anything.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 4,
      "text": "Instead, on solves a linear system, via other means.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 5,
      "text": "Of course, approximate linear system solvers then permit a wide tradeoff space to speed things up.",
      "suffix": "\n\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 6,
      "text": "There are several issues convolved here: one is ``full-matrix,'' another is that this is really a low-rank approximation to a matrix and so not full matrix, another is that this may or may not be implementable on GPUs.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 7,
      "text": "The latter may be important in practice, but it is orthogonal to the full matrix theory.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 8,
      "text": "There is a great deal of discussion about full-matrix preconditioning, but there is no full matrix here.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 9,
      "text": "Instead, it is a low-rank approximation to the full matrix.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 10,
      "text": "If there were theory to be had here, then I would guess that the low-rank approximation may work even when full matrix did not, e.g., since the full matrix case would involve too may parameters.",
      "suffix": "\n\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 11,
      "text": "The discussion of convergence to first order critical points is straightforward.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_quote",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 12,
      "text": "Adaptivity ratio is mentioned in the intro but not defined there.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 13,
      "text": "Why mention it here, if it's not being defined.",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 14,
      "text": "You say that second order methods are outside the scope, but you say that your method is particularly relevant for ill-conditioned problems.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_quote",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 15,
      "text": "It would help to clarify the connection between the Gram/correlation matrix of gradients and the Hessian and what is being done to ill-conditioning, since second order methods are basically designed for ill-conditioned problems..",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_clarification",
      "aspect": "asp_clarity",
      "polarity": "none"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 16,
      "text": "It is difficult to know what the theory says about the empirical results, given the tweaks discussed in Sec 2.2, and so it is difficult to know what is the benefit of the method versus the tweaks.",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 17,
      "text": "The results shown in Figure 4 are much more interesting than the usual training curves which are shown in the other figures.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_positive"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 18,
      "text": "If this method is to be useful, understanding how these spectral properties change during training for different types of networks is essential.",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_soundness-correctness",
      "polarity": "none"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 19,
      "text": "More papers should present this, and those that do should do it more systematically.",
      "suffix": "\n\n",
      "review_action": "arg_other",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJgICDN92m",
      "sentence_index": 20,
      "text": "You say that you \"informally state the main theorem.\"  The level of formality/informality makes it hard to know what is really being said.  You should remove it if it is not worth stating precisely, or state it precisely.  (It's fair to modularize the proof, but as it is it's hard to know what it's saying, except that your method comes with some guarantee that isn't stated.)",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 0,
      "text": "Thanks for the review.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 1,
      "text": "There are two significant inaccuracies:",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 2,
      "text": "1. GGT does not take the view of a low-rank *approximation*. This is a central point of the paper.",
      "suffix": "\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          6,
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 3,
      "text": "2. Re: iterative methods: the preconditioner is a -1/2 power of the Gram matrix, not the inverse.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          2,
          3,
          4,
          5
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 4,
      "text": "More details below:",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 5,
      "text": "@Inverse square root: We are fully aware of the distinction.",
      "suffix": "\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          2,
          3,
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 6,
      "text": "- Note that iterative solvers like conjugate gradient do not immediately apply here, as we are solving a linear system in M^{1/2}, not M.",
      "suffix": "\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          2,
          3,
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 7,
      "text": "- Krylov subspace iterative solvers suffer from a condition number dependence, incurring a hard tradeoff between iteration complexity and \\eps. [1]",
      "suffix": "\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          2,
          3,
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 8,
      "text": "- We actually *did* try polynomial approximations to M^{-1/2} as an alternative to our proposed small-SVD step.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          2,
          3,
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 9,
      "text": "We saw worse approximation (the condition number dependence kicks in) and worse GPU performance (parallel computation time scales with polynomial degree).",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          2,
          3,
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 10,
      "text": "@Full-matrix terminology: The use of \u201cfull-matrix\u201d to distinguish from \u201cdiagonal-matrix\u201d is standard, and taken directly from [2].",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          6,
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 11,
      "text": "@Full-matrix vs. full-rank: Note that we do not consider the windowed Gram matrix to be an \u201capproximation\u201d of the \u201cfull\u201d gram matrix.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 12,
      "text": "The window is for the purpose of forgetting gradients from the distant past, motivated by (1) our theory, (2) the small-scale synthetic experiments, and (3) the extreme ubiquity of Adam and RMSprop, which do the same.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 13,
      "text": "Note that we do no approximation on the windowed Gram matrix, the fact that it is low rank is a feature.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 14,
      "text": "@Location of \\mu definition: Is the reviewer\u2019s suggestion simply to move this definition into the intro?",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_followup",
      "alignment": [
        "context_sentences",
        [
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 15,
      "text": "@Comparison with second-order methods",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          14,
          15
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 16,
      "text": ": Please refer to our response to Reviewer 1 for some additional comments.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          14,
          15
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 17,
      "text": "@Tweaks: We don\u2019t believe that any of the tweaks should be so controversial.",
      "suffix": "\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 18,
      "text": "- The \\eps parameters are present in *every* adaptive optimizer, for stability.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 19,
      "text": "The interpolation with SGD is just another take on this.",
      "suffix": "\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 20,
      "text": "- The exponential smoothing of the first moment estimator is a subtler point.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 21,
      "text": "As we point out in Appendix A.2, in the theory for Adam/AMSgrad [3,4], \\beta_1 *degrades* the moment estimation, yet everyone uses momentum in practice.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 22,
      "text": "Even if this is unconvincing, the performance gap upon removing this tweak is minor, and our empirical results hold without this tweak.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 23,
      "text": "We are simply offering a heuristic that we have observed to help training unconditionally, just like momentum in Adam.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 24,
      "text": "@Informal main theorem: By \u201cinformal\u201d we truly mean that we are suppressing the smoothness constants (L, M) for readability and space constraints. We are simply adopting the widespread practice of deferring the non-asymptotic mathematical statement to the appendix.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 25,
      "text": "[1] Tight complexity bounds for optimizing composite objectives. Blake E Woodworth, Nati Srebro. NIPS 2016.",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 26,
      "text": "[2] Adaptive subgradient methods for online learning and stochastic optimization. J Duchi, E Hazan, Y Singer. JMLR 2012.",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 27,
      "text": "[3] Adam: A Method for Stochastic Optimization. D.Kingma,J. Ba. ICLR 2015.",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJgICDN92m",
      "rebuttal_id": "r1e2sLyfRm",
      "sentence_index": 28,
      "text": "[4] On the Convergence of Adam and Beyond. S. Reddi, S. Kale, S. Kumar. ICLR 2018.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    }
  ]
}