{
  "metadata": {
    "forum_id": "BkgrBgSYDS",
    "review_id": "S1g74Sfq9H",
    "rebuttal_id": "rJl51UQ_iH",
    "title": "Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps",
    "reviewer": "AnonReviewer4",
    "rating": 8,
    "conference": "ICLR2020",
    "permalink": "https://openreview.net/forum?id=BkgrBgSYDS&noteId=rJl51UQ_iH",
    "annotator": "anno2"
  },
  "review_sentences": [
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 0,
      "text": "Summary",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 1,
      "text": "The authors introduce kaleidoscope matrices (K-matrices) and propose to use them as a substitute for structured matrices arising in ML applications (e.g. circulant matrix used for the convolution operation).",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 2,
      "text": "The authors prove that K-matrices are expressive enough to capture any structured matrix with near-optimal space and matvec time complexity.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 3,
      "text": "The authors demonstrate that learnable K-matrices achieve similar metrics compared to hand-crafted features on speech processing and computer vision tasks, can learn from permuted images, achieve performance close to a CNN trained on unpermuted images and demonstrate the improvement of inference speed of a transformer-based architecture for a machine translation task.",
      "suffix": "\n\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 4,
      "text": "Review",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 5,
      "text": "The overall quality of the paper is high.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_positive"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 6,
      "text": "The main contribution of the paper is the introduction of a family of matrices called kaleidoscope matrices (or K-matrices) which can be represented as a product of block-diagonal matrices of a special structure.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 7,
      "text": "Because of the special structure, the family allows near-optimal time matvec operations with near-optimal space complexity for structured matrices which are commonly used in deep architectures.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 8,
      "text": "The proposed approach is novel.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 9,
      "text": "It gives a new characterization of sparse matrices with optimal space complexity up to a logarithmic term.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 10,
      "text": "Moreover, the proposed characterization is able to learn any structured matrix and matvec time complexity of the K-matrix representation is near-optimal matvec time complexity of the structured matrix.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 11,
      "text": "Even though in the worst-case complexity is not optimal, the authors argue that for matrices that are commonly used in machine learning architectures (e.g. circulant matrix in a convolution layer) the characterization is optimal.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 12,
      "text": "This results in a new differentiable layer based on a K-matrix that can be trained with the rest of an architecture using standard stochastic gradient methods.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 13,
      "text": "However, it is worth noting that the reviewer is not an expert in the field, and it is hard for him to compare the proposed approach with previous work.",
      "suffix": "\n\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 14,
      "text": "The paper is generally easy to follow.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_positive"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 15,
      "text": "Even though the introduction of K-matrices requires a lot of definitions, they are presented clearly and Figure 1 helps to understand the concept of K-matrices.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_positive"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 16,
      "text": "The experimental pipeline is also clear.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_positive"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 17,
      "text": "Given the special structure of the family, the reviewer might guess that having K-matrices can slow down the training, i.e. it might require more epochs to achieve the reported results compared to baselines.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 18,
      "text": "Providing training plots might increase the quality of the paper.",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 19,
      "text": "The experimental results are convincing.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_positive"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 20,
      "text": "First, the authors show that K-matrices can be used instead of a handcrafted MFSC featurization in an LSTM-based architecture on the TIMIT speech recognition benchmark with only a 0.4% loss of phoneme error rate.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 21,
      "text": "Then, the authors evaluate K-matrices on ImageNet dataset.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 22,
      "text": "In order to do so, they compare a lightweight ShuffleNet architecture which uses a handcrafted permutation layer to the same architecture but with a learnable K-matrix instead of the permutation layer.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 23,
      "text": "The authors demonstrate the 5% improvement of accuracy over the ShuffleNet with 0.46M parameters with only 0.05M additional parameters of the K-matrix and the 1.2% improvement of accuracy over the ShuffleNet with 2.5M parameters with only 0.2M additional parameters of the K-matrix.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 24,
      "text": "Next, the authors show that K-matrices can be used to train permutations in image classification domains.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 25,
      "text": "In order to demonstrate so, they take the Permuted CIFAR-10 dataset and ResNet-18 architecture, insert a trainable K-matrix at the beginning of the architecture and compare against ResNet-18 with an inserted FC-layer (attempting to learn the permutation as well) and ResNet-18 trained on the original, unpermuted CIFAR-10 dataset.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 26,
      "text": "With K-matrix, the authors achieve a 7.9% accuracy improvement over FC+ResNet-18 and only a 2.4% accuracy drop compared to ResNet-18 trained on the original CIFAR-10.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 27,
      "text": "Finally, the authors demonstrate that K-matrices can be used instead of the decoder\u2019s linear layers in a Transformer-based architecture on the IWSLT-14 German-English translation benchmark which allows obtaining 30% speedup of the inference using a model with 25% fewer parameters with 1.0 drop of BLEU score.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 28,
      "text": "Overall, the analysis and the empirical evaluations suggest that K-matrices can be a practical tool in modern deep architectures with a variety of potential benefits and tradeoffs between a number of parameters, inference speed and accuracy, and ability to learn complex structures (e.g. permutations).",
      "suffix": "\n\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 29,
      "text": "Improvements",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 30,
      "text": "1. Even though K-matrices are aimed at structured matrices, it would be curious either to empirically compare K-matrices to linear transformations in fully-connected networks (i.e. dense matrices) or to provide some theoretical analysis.",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_experiment",
      "aspect": "asp_substance",
      "polarity": "none"
    },
    {
      "review_id": "S1g74Sfq9H",
      "sentence_index": 31,
      "text": "2. Section 3.3 argues that K-matrices allow to obtain an improvement of inference speed, however, providing the results of convergence speed (e.g. training plots with a number of epochs) will allow a better understanding of the proposed approach and will improve the quality of the paper.",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "none"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "S1g74Sfq9H",
      "rebuttal_id": "rJl51UQ_iH",
      "sentence_index": 0,
      "text": "We appreciate the reviewer\u2019s positive comments about our work.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "S1g74Sfq9H",
      "rebuttal_id": "rJl51UQ_iH",
      "sentence_index": 1,
      "text": "Regarding the convergence and speed of training, we would like to stress that all hyperparameters for training were kept the same as those for training the default model architecture, other than those we explicitly mentioned as being tuned (e.g. learning rate for the speech experiment).",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          17,
          18,
          31
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1g74Sfq9H",
      "rebuttal_id": "rJl51UQ_iH",
      "sentence_index": 2,
      "text": "In particular, for all experiments, the number of epochs is the same for both the baseline approach and the K-matrix approach.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          17,
          18,
          31
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1g74Sfq9H",
      "rebuttal_id": "rJl51UQ_iH",
      "sentence_index": 3,
      "text": "Additionally, for the speech preprocessing and ShuffleNet experiments, we compare the total wall-clock training time of our K-matrix approach to that of the baseline approach, in both cases finding that the training time required by our approach is at most 20% longer than that of the baseline approach.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          17,
          18,
          31
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1g74Sfq9H",
      "rebuttal_id": "rJl51UQ_iH",
      "sentence_index": 4,
      "text": "In our updated revision, we also include the training time comparison for the DynamicConv model in Appendix B.4.2 (in this case, the modified model with K-matrices actually trains slightly faster than the baseline).",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          17,
          18,
          31
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1g74Sfq9H",
      "rebuttal_id": "rJl51UQ_iH",
      "sentence_index": 5,
      "text": "We agree with the reviewer that a training plot can help provide a better understanding of how our proposed approach performs, and therefore have included an example plot (for the ShuffleNet experiment) in our updated revision (in Appendix B.2.3).",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          17,
          18,
          31
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "S1g74Sfq9H",
      "rebuttal_id": "rJl51UQ_iH",
      "sentence_index": 6,
      "text": "Regarding empirical comparisons to dense matrices, in Table 5 (Appendix B.1.2), we compare the use of K-matrices in the raw-features speech model with several other classes of matrices, including dense matrices.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          30
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1g74Sfq9H",
      "rebuttal_id": "rJl51UQ_iH",
      "sentence_index": 7,
      "text": "We find that, while using a trainable dense matrix slightly outperforms just using the fixed FFT (0.3% drop in test phoneme error rate), using a K-matrix instead of a dense matrix yields a further improvement of 0.8% in the phoneme error rate.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          30
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1g74Sfq9H",
      "rebuttal_id": "rJl51UQ_iH",
      "sentence_index": 8,
      "text": "Another empirical comparison of K-matrices and dense matrices is in Section 3.3, in which we replace the linear layers in the decoder of a DynamicConv model with K-matrices; these linear layers are by default dense (fully-connected) matrices.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          30
        ]
      ],
      "details": {}
    },
    {
      "review_id": "S1g74Sfq9H",
      "rebuttal_id": "rJl51UQ_iH",
      "sentence_index": 9,
      "text": "Theoretically, in Lemma E.3 we show that arbitrary dense matrices are contained in the BB* hierarchy \u2013 in particular, that any n x n matrix is in (BB*)^{2n-2}, which implies that its K-matrix representation requires at most (4n log n)*(2n-2) = O(n^2 log n) parameters and thus is tight up to a logarithmic factor in n.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          30
        ]
      ],
      "details": {}
    }
  ]
}