{
  "metadata": {
    "forum_id": "BkgrBgSYDS",
    "review_id": "HyeOD450tr",
    "rebuttal_id": "r1eYbUQ_oB",
    "title": "Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps",
    "reviewer": "AnonReviewer2",
    "rating": 8,
    "conference": "ICLR2020",
    "permalink": "https://openreview.net/forum?id=BkgrBgSYDS&noteId=r1eYbUQ_oB",
    "annotator": "anno2"
  },
  "review_sentences": [
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 0,
      "text": "This paper introduces a structured drop-in replacement for linear layers in a neural network, referred to as Kaleidoscope matrices.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 1,
      "text": "The class of such matrices are proven to be highly expressive and includes a very general class of sparse matrices, including convolution, Fastfood, and permutation matrices.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 2,
      "text": "Experiments are carried in a variety of settings: (i) can nearly replace a series of hand-designed feature extractor, (ii) can perform better than fixed permutation matrices (though parameter count also increased by 10%), (iii) can learn permutations, and (iv) can help reduce parameter count and increase inference speed with a small performance degradation of 1.0 BLEU on machine translation.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 3,
      "text": "This appears to be a solid contribution in terms of both theory and practical use.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_motivation-impact",
      "polarity": "pol_positive"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 4,
      "text": "As I have not thought much about expressiveness in terms of arithmetic circuits (though I was unable to fully follow or appreciate the derivations, the explanations all seem reasonable)",
      "suffix": "",
      "review_action": "arg_other",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 5,
      "text": ", my main comments are regarding experiments.",
      "suffix": "",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 6,
      "text": "Though there are experiments in different domains, each could benefit from some additional ablations, especially to existing parameterizations of structured matrices such as Fastfood, ACDC, and any of the multiple works on permutation matrices and/or orthogonal matrices.",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_experiment",
      "aspect": "asp_substance",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 7,
      "text": "Though Kaleidoscope include these as special cases, it is not clear whether when given the same resources (either memory or computational cost), Kaleidoscope would outperform them.",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_experiment",
      "aspect": "asp_substance",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 8,
      "text": "There is also a matter of ease of training compared to existing approximations or relaxations, e.g. Gumbel-Sinkhorn.",
      "suffix": "\n\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 9,
      "text": "Pros:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 10,
      "text": "- The writing is easy to follow and concise, with contributions and place in the literature clearly stated.",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_positive"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 11,
      "text": "- The Kaleidoscope matrix seem generally applicable, both proven theoretically and shown empirically (experiments are spread across a wide range of domains).",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_positive"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 12,
      "text": "- The code includes specific C++ and CUDA kernels for computing K matrices, which will be very useful for adaptation.",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_replicability",
      "polarity": "pol_positive"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 13,
      "text": "- The reasoning using arithmetic circuits seems interesting, and the Appendix includes a primer.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_positive"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 14,
      "text": "Cons:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 15,
      "text": "- For the squeezenet and latent permutation experiments, would be nice if there is a comparison to other parameterizations of permutation matrices, e.g. gumbel-sinkhorn.",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_experiment",
      "aspect": "asp_meaningful-comparison",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 16,
      "text": "- For the speed processing experiment, did you test what the performance would be if K matrix is replaced by a fully connected layer?",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_substance",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 17,
      "text": "This comparison appears in other experiments, but seems to be missing here for some reason.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_meaningful-comparison",
      "polarity": "pol_negative"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 18,
      "text": "It would lead to better understanding than only comparing to SincNet.",
      "suffix": "\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 19,
      "text": "- The setup for the learning to permute experiment is not as general as it would imply in the main text.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 20,
      "text": "The matrices are constrained so that an actual permutation matrix is always sampled, and the permutation is (had to be?) pretrained to reduce total variation for 100 epochs before jointly trained with the classifier.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 21,
      "text": "Though this is stated very clearly in the Appendix, I hope the authors can also communicate this clearly in the main text as it appears to be a crucial component of the experimental setup.",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 22,
      "text": "Comments:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 23,
      "text": "- How easy is it to train with K matrices? Did you have to change optimizer hyperparameter compared to existing baselines?",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_clarification",
      "aspect": "asp_clarity",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 24,
      "text": "- There seems to be some blurring between the meaning of structure (used to motivate K matrices in the introduction) and sparsity (used to analyze K matrices).",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_clarification",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 25,
      "text": "Structure might also include parameter sharing, orthogonality, and maybe other concepts.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HyeOD450tr",
      "sentence_index": 26,
      "text": "For instance, while Kaleidoscope matrices might include the subclass of circulant matrices, can they also capture the same properties or \"inductive bias\" (for lack of better word) as convolutional layers when trained?",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_soundness-correctness",
      "polarity": "none"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 0,
      "text": "We thank the reviewer for their encouraging feedback and thoughtful comments on our work.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 1,
      "text": "Regarding the permutation learning experiment, in response to the feedback, we have revised the main text to clarify the setup.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          15,
          19,
          20,
          21
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 2,
      "text": "The core of the experiment is the ability to denoise permuted images using some representation of the permutation set.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          19,
          20,
          21
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 3,
      "text": "In order to do this successfully, it is necessary for such a representation to have certain properties such as inducing a distribution over permutations.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          19,
          20,
          21
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 4,
      "text": "We have implemented and added a comparison to the Gumbel-Sinkhorn method (Mena et al., 2018), which is a customized representation for permutations with these properties, and requires similar techniques (unsupervised objective, permutation sampling, etc.) in order to learn the latent structure.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          15,
          19,
          20,
          21
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 5,
      "text": "The ResNet classifier on top can be viewed primarily as a way to evaluate the quality of the learned permutation; both of these representations are capable of learning the right latent structure, with test accuracies of 93.6 (Kaleidoscope) and 92.9 (Gumbel-Sinkhorn) respectively.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          19,
          20,
          21
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 6,
      "text": "The highlight of this experiment is that the K-matrix representation also comes with the requisite properties for this learning pipeline, despite not being explicitly designed for permutation learning.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          19,
          20,
          21
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 7,
      "text": "Regarding comparison to a dense matrix for the speech experiment, in Table 5 (Appendix B.1.2), we compare the use of K-matrices in the raw-features speech model with several other classes of matrices, including dense matrices.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          16,
          17,
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 8,
      "text": "For instance, we find that, while using a trainable dense matrix slightly outperforms just using the fixed FFT (0.3% drop in test phoneme error rate), using a K-matrix instead of a dense matrix yields a further improvement of 0.8% in the phoneme error rate.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          16,
          17,
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 9,
      "text": "Regarding ease of training and hyperparameter tuning, we would like to re-emphasize that for all experiments, all hyperparameters for training were kept the same as those for training the default model architecture, other than those we explicitly mentioned as being tuned.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          23
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 10,
      "text": "In particular, we did not modify any hyperparameters (such as number of epochs, optimizer, or learning rate) for the ShuffleNet and DynamicConv experiments.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          23
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 11,
      "text": "For the TIMIT speech experiment, we tune only the \u201cpreprocessing layer\u201d learning rate.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          23
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 12,
      "text": "This is because the default speech pipeline already uses different learning rates for different portions of the network, so there is no clear choice a priori for the learning rate of the \u201cpreprocessing layer\u201d (note that most methods, including K-matrices, do not seem to be overly sensitive to the choice of this learning rate).",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          23
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 13,
      "text": "Thus, in these experiments, K-matrices can be used as a drop-in replacement for linear layers without significant tuning effort.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          23
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 14,
      "text": "Regarding structure and sparsity: We use \u201cstructure\u201d in the context of structured matrices to mean matrices with a fast (subquadratic) multiplication algorithm.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          24,
          25,
          26
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 15,
      "text": "Structured matrices have a sparse factorization with total NNZ on the order of the number of operations required in the multiplication.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          24,
          25,
          26
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 16,
      "text": "This connection was known in the algebraic complexity community, and formalized by De Sa et al. (2018).",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          24,
          25,
          26
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 17,
      "text": "Regarding the inductive bias encoded by K-matrices: the building block of K-matrices is a butterfly matrix, which encodes the recursive divide-and-conquer structure of many fast algorithms such as the FFT.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          24,
          25,
          26
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HyeOD450tr",
      "rebuttal_id": "r1eYbUQ_oB",
      "sentence_index": 18,
      "text": "Analyzing the precise effects of the inductive bias imposed by K-matrices is an exciting question for future work.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_future",
      "alignment": [
        "context_sentences",
        [
          24,
          25,
          26
        ]
      ],
      "details": {}
    }
  ]
}