{
  "metadata": {
    "forum_id": "HJzLdjR9FX",
    "review_id": "rJgUwKHDhX",
    "rebuttal_id": "B1gCk1tNpm",
    "title": "DeepTwist: Learning Model Compression via Occasional Weight Distortion",
    "reviewer": "AnonReviewer3",
    "rating": 4,
    "conference": "ICLR2019",
    "permalink": "https://openreview.net/forum?id=HJzLdjR9FX&noteId=B1gCk1tNpm",
    "annotator": "anno10"
  },
  "review_sentences": [
    {
      "review_id": "rJgUwKHDhX",
      "sentence_index": 0,
      "text": "This paper proposed a general framework, DeepTwist, for model compression.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJgUwKHDhX",
      "sentence_index": 1,
      "text": "The so-called weight distortion procedure is added into the training every several epochs.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJgUwKHDhX",
      "sentence_index": 2,
      "text": "Three applications are shown to demonstrate the usage of the proposed approach.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJgUwKHDhX",
      "sentence_index": 3,
      "text": "Overall, I think the novelty of the paper is very limited, as all the weight distortion algorithms in the paper can be formulated as the proximal function in proximal gradient descent.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_negative"
    },
    {
      "review_id": "rJgUwKHDhX",
      "sentence_index": 4,
      "text": "See http://www.stat.cmu.edu/~ryantibs/convexopt-S15/scribes/08-prox-grad-scribed.pdf for a reference.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_negative"
    },
    {
      "review_id": "rJgUwKHDhX",
      "sentence_index": 5,
      "text": "Specifically, the proposed framework can be easily reformulated as a loss function plus a regularizer for proximal gradient.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJgUwKHDhX",
      "sentence_index": 6,
      "text": "Using gradient descent (GD), there will be two steps: (1) finding a new solution using GD, and (2) project the new solution using proximal function.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJgUwKHDhX",
      "sentence_index": 7,
      "text": "Now in deep learning, since SGD is used for optimization, several steps are need to locate reasonable solutions, i.e. the Distortion Step in the framework.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJgUwKHDhX",
      "sentence_index": 8,
      "text": "Then proximal function can be applied directly after Distortion Step to project the solutions.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJgUwKHDhX",
      "sentence_index": 9,
      "text": "In this way, we can easily see that the proposed framework is a stochastic version of proximal gradient descent.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_negative"
    },
    {
      "review_id": "rJgUwKHDhX",
      "sentence_index": 10,
      "text": "Since SGD is used for training, several minibatches are needed to achieve a relatively stable solution for projection using the proximal function, which is exactly the proposed framework in Fig. 1.",
      "suffix": "\n\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJgUwKHDhX",
      "sentence_index": 11,
      "text": "PS: After discussion, I think the motivation of the method is not clear to understand why the proposed method works.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "rJgUwKHDhX",
      "rebuttal_id": "B1gCk1tNpm",
      "sentence_index": 0,
      "text": "Thank you for the review.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "rJgUwKHDhX",
      "rebuttal_id": "B1gCk1tNpm",
      "sentence_index": 1,
      "text": "While formulating a proximal function for model compression might be an interesting idea (if search space is highly limited) as the reviewer suggested, we believe that our proposed method is fundamentally different from proximal gradient descent approaches due to the following reasons:",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          3
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJgUwKHDhX",
      "rebuttal_id": "B1gCk1tNpm",
      "sentence_index": 2,
      "text": "1) Proximal gradient descent is meant to solve a convex optimization problem while our aim is to solve a non-convex problem in which each local minimum exhibits vastly different test accuracy after compression.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          3
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJgUwKHDhX",
      "rebuttal_id": "B1gCk1tNpm",
      "sentence_index": 3,
      "text": "Jumping to another local minimum from a certain minimum would not be easily achieved by convex optimization methods.",
      "suffix": "\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          3
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJgUwKHDhX",
      "rebuttal_id": "B1gCk1tNpm",
      "sentence_index": 4,
      "text": "2) Finding a particular flat minimum is the key to obtaining good model compression (and good generalization as well).",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          3
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJgUwKHDhX",
      "rebuttal_id": "B1gCk1tNpm",
      "sentence_index": 5,
      "text": "Such an exploration, however, cannot be obtained by a proximal function since we need to investigate lots of different local minima with different amount of flatness in loss surface.",
      "suffix": "\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          3
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJgUwKHDhX",
      "rebuttal_id": "B1gCk1tNpm",
      "sentence_index": 6,
      "text": "3) While proximal gradient descent can be useful to find a certain local minimum close to the starting point given a convex constraint, wide exploration (associated with possibly transient accuracy loss in the initial training as shown in Figure 2.(b)) is necessary to escape from a point with sharp loss surface.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          3
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJgUwKHDhX",
      "rebuttal_id": "B1gCk1tNpm",
      "sentence_index": 7,
      "text": "Investigating many different local minima would be only available with large learning rate (as we have chosen for our experiments) and/or large amount of weight distortion.",
      "suffix": "\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          3
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJgUwKHDhX",
      "rebuttal_id": "B1gCk1tNpm",
      "sentence_index": 8,
      "text": "4) Our effort to introduce optimal distortion step size and learning rate for a given compression problems is connected to exploration, not exploitation (which potentially supported by proximal functions where convergence matters).",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          3
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJgUwKHDhX",
      "rebuttal_id": "B1gCk1tNpm",
      "sentence_index": 9,
      "text": "Even though proximal gradient descent selects step size only considering convergence, Figure 1 can lead to the results such as Figure 2(b) which cannot be obtained if only local exploitation is employed.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          3
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJgUwKHDhX",
      "rebuttal_id": "B1gCk1tNpm",
      "sentence_index": 10,
      "text": "Finding a flat minimum has been known to be a difficult work as shown in the paper \u201cOn large-batch training for deep learning: generalization gap and sharp minima\u201d, ICLR 2016.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          3
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJgUwKHDhX",
      "rebuttal_id": "B1gCk1tNpm",
      "sentence_index": 11,
      "text": "We firmly believe that our search space exploration method based on optimal distortion step size and amount of weight distortion enable us to produce better local minima well-suited to various model compression techniques.",
      "suffix": "\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          3
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJgUwKHDhX",
      "rebuttal_id": "B1gCk1tNpm",
      "sentence_index": 12,
      "text": "In short, unfortunately, we have failed to understand how you could connect our technique to proximal functions and proximal gradient descent.",
      "suffix": "\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          3
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJgUwKHDhX",
      "rebuttal_id": "B1gCk1tNpm",
      "sentence_index": 13,
      "text": "We strongly hope that you reconsider your decision.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    }
  ]
}