{
  "metadata": {
    "forum_id": "rJl2E3AcF7",
    "review_id": "B1euHOqi37",
    "rebuttal_id": "B1e8djk9Tm",
    "title": "Doubly Sparse: Sparse Mixture of Sparse Experts for Efficient Softmax Inference",
    "reviewer": "AnonReviewer2",
    "rating": 6,
    "conference": "ICLR2019",
    "permalink": "https://openreview.net/forum?id=rJl2E3AcF7&noteId=B1e8djk9Tm",
    "annotator": "anno10"
  },
  "review_sentences": [
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 0,
      "text": "The present paper proposes a fast approximation to the softmax computation when the number of classes is very large.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 1,
      "text": "This is typically a bottleneck in deep learning architectures.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 2,
      "text": "The approximation is a sparse two-layer mixture of experts.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 3,
      "text": "The paper lacks rigor and the writing is of low quality, both in its clarity and its grammar.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 4,
      "text": "See a list of typos below.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 5,
      "text": "An example of lack of mathematical rigor is equation 4 in which the same variable name is used to describe the weights before and after pruning, as if it was computer code instead of an equation.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 6,
      "text": "Also pervasive is the use of the asterisk to denote multiplication, again as if it was code and not math.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 7,
      "text": "Algorithm 1 does not include mitosis, which may have an effect on the resulting approximation.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 8,
      "text": "How are the lambda and threshold parameters tuned? The authors mention a validation set, are they just exhaustively explored on a 3D grid on the validation set?",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_replicability",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 9,
      "text": "The results only compare with Shim et al. Why only this method? Why would it be expected to be faster than all the other alternatives? Wouldn't similar alternatives like the sparsely gated MoE, D-softmax and adaptive-softmax have chances of being faster?",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 10,
      "text": "The column \"FLOPS\" in the result seems to measure the speedup, whereas the actual FLOPS should be less when the speed increases.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 11,
      "text": "Also, a \"1x\" label seems to be missing in for the full softmax, so that the reference is clearly specified.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 12,
      "text": "All in all, the results show that the proposed method provides a significant speedup with respect to Shim et al., but it lacks comparison with other methods in the literature.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 13,
      "text": "A brief list of typos:",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 14,
      "text": "\"Sparse Mixture of Sparse of Sparse Experts\"",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 15,
      "text": "\"if we only search right answer\"",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 16,
      "text": "\"it might also like appear\"",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 17,
      "text": "\"which is to design to choose the right\"",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 18,
      "text": "sparsly",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 19,
      "text": "\"will only consists partial\"",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 20,
      "text": "\"with \u03b3 is a lasso threshold\"",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 21,
      "text": "\"an arbitrarily distance function\"",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 22,
      "text": "\"each 10 sub classes are belonged to one\"",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "B1euHOqi37",
      "sentence_index": 23,
      "text": "\"is also needed to tune to achieve\"",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 0,
      "text": "Dear Reviewer,",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 1,
      "text": "Thank you for your valuable comments.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 2,
      "text": "We have revised our writing in the revision, and will further improve its clarity.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 3,
      "text": "Please find our response as follows.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 4,
      "text": "- Algorithm 1 does not include mitosis, which may have an effect on the resulting approximation.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 5,
      "text": "Mitosis training can be considered as executing Algorithm 1 for multiple times with an increasing number of experts and inherited initialization from last round by changing W^e and W^g.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 6,
      "text": "Also, training with mitosis achieves similar performance as training without it shown in Appendix B, Figure (a).",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 7,
      "text": "- How are the lambda and threshold parameters tuned? The authors mention a validation set, are they just exhaustively explored on a 3D grid on the validation set?",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 8,
      "text": "The hyper-parameters related to DS-softmax (such as lambda) are tuned according to the performance on a validation dataset.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 9,
      "text": "Also, as we mentioned in the paper, only one hyper-parameter (group lasso lambda) needs to be tuned.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 10,
      "text": "The heuristic we use to tune group lasso lambda is to increase lambda, starting from a small value, until it hurts the performance.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 11,
      "text": "Also threshold and balancing lambda variables are kept fixed as (0.01 and 10).",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 12,
      "text": "- Why would it be expected to be faster than all the other alternatives? Wouldn't similar alternatives like the sparsely gated MoE, D-softmax and adaptive-softmax have chances of being faster?",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 13,
      "text": "In terms of baselines, SVD-softmax (NIPS\u201917) was chosen since it is a recent method that provides a significant inference speedup for softmax.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 14,
      "text": "Other alternatives, such as D-softmax and adaptive-softmax, focus on training instead of inference speedup.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 15,
      "text": "Furthermore, as claimed in their papers, they achieve limited speedup (around 5x) in language modeling, which is much worse than ours.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 16,
      "text": "With regards to Sparsely Gated MoE, it cannot speed up inference, since they select expert with full softmax.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 17,
      "text": "We would like to emphasize that most existing methods for inference speedup focus on approximating trained softmax layer, which usually suffers a loss on performance.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "B1euHOqi37",
      "rebuttal_id": "B1e8djk9Tm",
      "sentence_index": 18,
      "text": "Our model allows the adaptive adjustment of the softmax layer, achieves speedup through capturing the two-level overlapped hierarchy during training, which is novel and does not suffer from the performance loss.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    }
  ]
}