{
  "metadata": {
    "forum_id": "rJl6M2C5Y7",
    "review_id": "r1labrqphm",
    "rebuttal_id": "rkxxb72KCX",
    "title": "Online Hyperparameter Adaptation via Amortized Proximal Optimization",
    "reviewer": "AnonReviewer1",
    "rating": 6,
    "conference": "ICLR2019",
    "permalink": "https://openreview.net/forum?id=rJl6M2C5Y7&noteId=rkxxb72KCX",
    "annotator": "anno14"
  },
  "review_sentences": [
    {
      "review_id": "r1labrqphm",
      "sentence_index": 0,
      "text": "Summary:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 1,
      "text": "This paper introduces Amortized Proximal Optimization (APO) that optimizes a proximal objective at each optimization step.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 2,
      "text": "The optimization hyperparameters are optimized to best minimize the proximal objective.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 3,
      "text": "The objective is represented using a regularization style parameter lambda and a distance metric D that, depending on its definition, reduces the optimization procedure to Gauss-Newton, General Gauss Newton or Natural Gradient Descent.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 4,
      "text": "There are two key convergence results which are dependent on the meta-objective being optimized directly which, while not practical, gives some insight into the inner workings of the algorithm.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 5,
      "text": "The first result indicates strong convergence when using the Euclidean distance as the distance measure D.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 6,
      "text": "The second result shows strong convergence when D is set as the Bregman divergence.",
      "suffix": "\n\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 7,
      "text": "The algorithm optimizes the base optimizer on a number of domains and shows state-of-the-art results over a grid search of the hyperparameters on the same optimizer.",
      "suffix": "\n\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 8,
      "text": "Clarity and Quality: The paper is well written.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_positive"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 9,
      "text": "Originality: It appears to be a novel application of meta-learning.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 10,
      "text": "I wonder why the authors didn\u2019t compare or mention optimizers such as ADAM and ADAGRAD which adapt their parameters on-the-fly as well.",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 11,
      "text": "Also how does this compare to adaptive hyperparameter training techniques such as population based training?",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_substance",
      "polarity": "none"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 12,
      "text": "Significance:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 13,
      "text": "Overall it appears to be a novel and interesting contribution.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 14,
      "text": "I am concerned though why the authors didn\u2019t compare to adaptive optimizers such as ADAM and ADAGRAD and how the performance compares with population based training techniques.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 15,
      "text": "Also, your convergence results appear to rely on strong convexity of the loss.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 16,
      "text": "How is this a reasonable assumption?",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 17,
      "text": "These are my major concerns.",
      "suffix": "\n\n",
      "review_action": "arg_other",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1labrqphm",
      "sentence_index": 18,
      "text": "Question: In your experiments, you set the learning rate to be really low. What happens if you set it to be arbitrarily high? Can you algorithm recover good learning rates?",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_experiment",
      "aspect": "asp_substance",
      "polarity": "none"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 0,
      "text": "Thank you for your insightful comments. We have incorporated your suggestions into the revised version of the paper.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 1,
      "text": "Q: Relationship to optimizers with adaptive learning rates, and comparison between Adam and Adam-APO.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          10,
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 2,
      "text": "While Adam and Adagrad are often described as having \u201cadaptive learning rates,\u201d they still have a global learning rate that is just as critical to tune as for SGD.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_refute-question",
      "alignment": [
        "context_sentences",
        [
          10,
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 3,
      "text": "In our experiments, we consider tuning the learning rate for RMSprop, which also maintains adaptive learning rates for each parameter, and is closely related to Adam/Adagrad.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_refute-question",
      "alignment": [
        "context_sentences",
        [
          10,
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 4,
      "text": "Adam is essentially RMSprop with momentum; APO can be applied to Adam by applying momentum on top of the updates computed by APO.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_refute-question",
      "alignment": [
        "context_sentences",
        [
          10,
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 5,
      "text": "To address your question about Adam, we added experiments for tuning the global learning rate of Adam with APO in appendix Section G, Figure 14, where Adam-APO achieves better performance than Adam with a fixed global learning rate, and achieves comparable performance as Adam with a manual schedule.",
      "suffix": "\n\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 6,
      "text": "Q: Comparison with population-based training (PBT)",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 7,
      "text": "We have added a comparison between APO and PBT in appendix Section H, Figure 15.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 8,
      "text": "For population-based training, one must carefully select many hyperparameters, including the size of the population, the perturbation strategy (e.g., randomly perturb the learning rate by multiplying it by 1.2 or 0.8), the exploration interval (e.g., the number of training iterations to run before exploiting other members of the population).",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 9,
      "text": "We used PBT and APO to tune the learning rate of RMSprop while training a ResNet34 model on CIFAR-10.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 10,
      "text": "For PBT, we used a population of size 4, and chose to exploit/explore after each epoch of training.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 11,
      "text": "We tried multiple exploration strategies, and found that it was critical to set the probability of resampling a learning rate from an underlying distribution to be 0; otherwise, the learning rates could jump from small to large values, and yield unstable training.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 12,
      "text": "In contrast, APO only requires a simple grid search over lambda, and all other hyperparameters can be kept at their default settings.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 13,
      "text": "We found that  APO substantially outperformed PBT, achieving a lower final training loss and equal test accuracy in much less wall-clock time; this shows the advantage of gradient-based methods for tuning learning rates, such as APO, compared to evolutionary methods based on random perturbations such as PBT.",
      "suffix": "\n\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 14,
      "text": "Q: The convergence results appear to rely on strong convexity of the loss.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          15,
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 15,
      "text": "How is this a reasonable assumption?",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          15,
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 16,
      "text": "Note that we assume strong convexity of the loss as a function of the output units, not as a function of the weights.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 17,
      "text": "Hence, our assumption is fairly realistic in the neural net setting.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 18,
      "text": "The loss function on top of the network output is usually defined as a simple convex function; for instance, in regression, a common choice of loss function is the quadratic loss (i.e, the squared distance between the network output and the true label), which is strongly convex.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 19,
      "text": "In fact, even without assuming that the loss function is strongly convex and that the output manifold is dense, we are still able to show a fast convergence rate.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 20,
      "text": "In the updated version of the paper, we show that our algorithm with an oracle converges to stationary point globally with a fast rate, which provides insight into why APO works well.",
      "suffix": "\n\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 21,
      "text": "Q: In your experiments, you set the learning rate to be really low. What happens if you set it to be arbitrarily high? Can you algorithm recover good learning rates?",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 22,
      "text": "APO is robust to the initial learning rate of the base optimizer, using the default meta learning rate suggested in our updated paper.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          18
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 23,
      "text": "We have added a section to the appendix in which we include RMSprop-APO experiments on Rosenbrock, MNIST, and CIFAR-10 to show that the training loss, test accuracy, and learning rate trajectories are nearly identical when starting with initial learning rates {1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7}, spanning 5 orders of magnitude.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          18
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1labrqphm",
      "rebuttal_id": "rkxxb72KCX",
      "sentence_index": 24,
      "text": "Note that 1e-2 is quite a large initial learning rate for RMSprop.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          18
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    }
  ]
}