{
  "metadata": {
    "forum_id": "BJGVX3CqYm",
    "review_id": "H1gK_Y3J6Q",
    "rebuttal_id": "S1lyG-h7A7",
    "title": "Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search",
    "reviewer": "AnonReviewer1",
    "rating": 7,
    "conference": "ICLR2019",
    "permalink": "https://openreview.net/forum?id=BJGVX3CqYm&noteId=S1lyG-h7A7",
    "annotator": "anno10"
  },
  "review_sentences": [
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 0,
      "text": "In this work the authors introduce a new method for neural architecture search (NAS) and use it in the context of network compression.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 1,
      "text": "Specifically, the NAS method is used to select the precision quantization of the weights at each layer of the neural network.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 2,
      "text": "Briefly, this is done by first defining a super network, which is a DAG where for each pair of nodes, the output node is the linear combination of the outputs of all possible operations (i.e., layers with different precision quantizations).",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 3,
      "text": "Following [1], the weights of the linear combination are regarded as the probabilities of having certain operations (i.e., precision quantization), which allows for learning a probability distribution over the considered operations.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 4,
      "text": "Differently from [1], however, the authors bridge the soft sampling in [1] (where all operations are considered together but weighted accordingly to the corresponding probabilities) to a hard sampling (where a single operation is considered with the corresponding probability) through an annealing procedure based on the Gumbel Softmax technique.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 5,
      "text": "Through the proposed NAS algorithm, one can learn a probability distribution on the operations by minimizing a loss that accounts for both accuracy and model size.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 6,
      "text": "The final output of this search phase is a set of sampled architectures (containing a single operation at each connection between nodes), which are then retrained from scratch.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 7,
      "text": "In applications to CIFAR-10 and ImageNet, the authors achieve (and sometime surpass) state-of-the-art performance in model compression.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 8,
      "text": "The two contributions of this work are",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 9,
      "text": "1)\tA new approach to weight quantization using principles of NAS that is novel and promising;",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 10,
      "text": "2)\tNew insights/technical improvements in the broader field of NAS.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 11,
      "text": "While the utility of the method in the more general context of NAS has not been shown, this work will likely be of interest to the NAS community.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_motivation-impact",
      "polarity": "pol_positive"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 12,
      "text": "I only have one major concern.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 13,
      "text": "The architectures are sampled from the learnt probability distribution every certain number of epochs while training the supernet.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 14,
      "text": "Why? If we are learning the distribution, would not it make sense to sample all architectures only after training the supernet at our best?",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_replicability",
      "polarity": "pol_negative"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 15,
      "text": "This reasoning leads me to a second question.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 16,
      "text": "In the CIFAR-10 experiments, the authors sample 5 architecture every 10 epochs, which means 45 architectures (90 epochs were considered).",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 17,
      "text": "This is a lot of architectures, which makes me wonder: how would a \u201ccost-aware\u201d random sampling perform with the same number of sampled architectures?",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_experiment",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 18,
      "text": "Also, I have some more questions/minor concerns:",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 19,
      "text": "1)\tThe authors say that the expectation of the loss function is not directly differentiable with respect to the architecture parameters because of the discrete random variable.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_quote",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 20,
      "text": "For this reason, they introduce a Gumbel Softmax technique, which makes the mask soft, and thus the loss becomes differentiable with respect to the architecture parameters.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 21,
      "text": "However, subsequently in the manuscript, they write that Eq 6 provides an unbiased estimate for the gradients.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 22,
      "text": "Do they here refer to the gradients with respect to the weights ONLY?",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_clarification",
      "aspect": "asp_replicability",
      "polarity": "pol_negative"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 23,
      "text": "Could we say that the advantage of the Gumbel Softmax technique is two-fold? i) make the loss differentiable with respect to the arch parameters; ii) reduce the variance of the estimate of the loss gradients with respect to the network weights.",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_replicability",
      "polarity": "pol_negative"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 24,
      "text": "2)\tCan the author discuss why the soft sampling procedure in [1] is not enough? I have an intuitive understanding of this, but I think this should be clearly discussed in the manuscript as this is a central aspect of the paper.",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_replicability",
      "polarity": "pol_negative"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 25,
      "text": "3)\tThe authors use a certain number of warmup steps to train the network weights without updating the architecture parameters to ensure that \u201cthe weights are sufficiently trained\u201d. Can the authors discuss the choice on the number of warmup epochs?",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_replicability",
      "polarity": "pol_negative"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 26,
      "text": "I gave this paper a 5, but I am overall supportive. Happy to change my score if the authors can address my major concern.",
      "suffix": "\n\n",
      "review_action": "arg_social",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 27,
      "text": "[1] Liu H, Simonyan K, Yang Y. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055. 2018 Jun 24.",
      "suffix": "\n\n",
      "review_action": "arg_other",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 28,
      "text": "-----------------------------------------------------------",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 29,
      "text": "Post-Rebuttal",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 30,
      "text": "---------------------------------------------------------",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "sentence_index": 31,
      "text": "The authors have fully addressed my concerns. I changed the rating to a 7.",
      "suffix": "",
      "review_action": "arg_social",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 0,
      "text": "We want to thank the reviewer#1 for your feedback.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 1,
      "text": "Your summary correctly reflects the content of our paper.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 2,
      "text": "We hope this rebuttal can address your concerns.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 3,
      "text": "Major concern: Trained sampling vs random sampling",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          12,
          13,
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 4,
      "text": "We sample architectures every a few epochs, mainly because in our experiments, we want to analyze the behavior of the architecture distribution at different super net training epochs.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          12,
          13,
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 5,
      "text": "This analysis is illustrated in figure 3 of our paper.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          12,
          13,
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 6,
      "text": "We can see that at epoch-0, where the architecture distribution is trained for only one epoch (close to random sampling), the sampled architectures have much lower compression rate.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          12,
          13,
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 7,
      "text": "Similarly, for epoch-9, architectures also have relatively low compression rate.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          12,
          13,
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 8,
      "text": "In comparison, at epoch-79 and epoch-89, architectures have higher compression rates and accuracy.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          12,
          13,
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 9,
      "text": "The difference between epoch-79 vs. epoch-89 is small since the distribution has converged.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          12,
          13,
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 10,
      "text": "As the reviewer#2 suggests, we can train the super net until the last epoch, then sample and train architectures from this distribution.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 11,
      "text": "Figure 3 shows that the five architectures sampled at epoch-89 are much better than the five architectures at epoch-0, which are essentially drawn from random sampling.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 12,
      "text": "Also, note that for CIFAR10-ResNet-110 experiments, the search space contains 7^54 = 4x10^45 possible architectures, 45 sampled architectures are tiny compared with the search space.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 13,
      "text": "Reviewer #2 suggests comparing with a \u201ccost-aware\u201d random sampling policy.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 14,
      "text": "We tried a simple baseline that at each layer, we sample a conv operator with b-bit precision with probability",
      "suffix": "\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17
        ]
      ],
      "details": {
        "request_out_of_scope": false
      }
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 15,
      "text": "prob(precision=b) ~",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 16,
      "text": "1/(1 + b)",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 17,
      "text": "The performance of this policy is much worse since for a conv operator with precision-0 (in our notation, bit-0 denotes we skip the layer), the sampling probability is 33x higher than full-precision convolution, 2x higher than 1-bit, 3x higher than 2-bit, and so on.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 18,
      "text": "Architectures sampled from this distribution are extremely small but with much worse accuracy.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 19,
      "text": "We understand this might not be the best \u201ccost-aware\u201d sampling policy.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 20,
      "text": "If reviewer#1 has better suggestions, we are happy to try.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 21,
      "text": "Minor concern #1: Value of the Gumbel Softmax function",
      "suffix": "\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          19,
          20,
          21,
          22,
          23
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 22,
      "text": "Yes. We agree with the comments that the advantages of the Gumbel Softmax technique are two-fold:",
      "suffix": "\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          19,
          20,
          21,
          22,
          23
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 23,
      "text": "1. It makes the loss function differentiable with respect",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          19,
          20,
          21,
          22,
          23
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 24,
      "text": "to the architecture parameter \\theta",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          19,
          20,
          21,
          22,
          23
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 25,
      "text": ".",
      "suffix": "\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          19,
          20,
          21,
          22,
          23
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 26,
      "text": "2. Compared with other gradient estimation techniques such as Reinforce, Gumbel Softmax balances the variance/bias of the gradient estimation with respects to weights.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          19,
          20,
          21,
          22,
          23
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 27,
      "text": "Minor concern #2: Comparison with non-stochastic method such as DARTS",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          24
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 28,
      "text": "DARTS [1] does not really sample candidate operators during the forward pass.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          24
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 29,
      "text": "Outputs of candidate operators are multiplied with some coefficients and summed together.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          24
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 30,
      "text": "For the problem of mixed precision quantization, this can be problematic.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          24
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 31,
      "text": "Let's consider a simplified scenario",
      "suffix": "\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          24
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 32,
      "text": "y = alpha_1 * y_1 + alpha_2 * y_2",
      "suffix": "\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          24
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 33,
      "text": "Let's assume both y_1 and y_2 are in binary and are in {0, 1}. Assuming alpha_1=0.5 and alpha_2=0.25, then the possible values of y are {0, 0.25, 0.5, 0.75}, which essentially extend the effective bit-width to 2 bit.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          24
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 34,
      "text": "This is good for the super net's accuracy, but the performance of the super net cannot transfer to the searched architectures in which we have to pick only one operator per layer.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          24
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 35,
      "text": "Using our method, however, the sampling ensures that the super net only picks one operator at a time and the behavior can transfer to the searched architectures.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          24
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 36,
      "text": "Minor concern #3: Warmup training",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          25
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 37,
      "text": "We use warmup training since in our ImageNet experiments.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          25
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 38,
      "text": "We observe that at the beginning of the super net training, the operators are not sufficiently trained, and their contributions to the overall accuracy are not clear, but their cost differences are always significant.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          25
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 39,
      "text": "As a result, the search always picks low-cost operators.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          25
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 40,
      "text": "To prevent this, we use warmup training to ensure all the candidate operators are sufficiently trained before we optimize architecture parameters.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          25
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 41,
      "text": "In our ImageNet experiments, we found that ten warmup epochs are good enough.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          25
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1gK_Y3J6Q",
      "rebuttal_id": "S1lyG-h7A7",
      "sentence_index": 42,
      "text": "In CIFAR-10 experiments, warmup training is not needed.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          25
        ]
      ],
      "details": {}
    }
  ]
}