{
  "metadata": {
    "forum_id": "rylwJxrYDS",
    "review_id": "rJxLRp2s9H",
    "rebuttal_id": "ByeQLeYhsH",
    "title": "vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations",
    "reviewer": "AnonReviewer4",
    "rating": 8,
    "conference": "ICLR2020",
    "permalink": "https://openreview.net/forum?id=rylwJxrYDS&noteId=ByeQLeYhsH",
    "annotator": "anno2"
  },
  "review_sentences": [
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 0,
      "text": "This paper presents a method for unsupervised representation learning of speech.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 1,
      "text": "The idea is to first learn discrete representation (vector quantization is done by Gumbel softmax or k-means) from audio samples with contrastive prediction coding type objective, and then perform BERT-style pre-training (borrowed from NLP).",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 2,
      "text": "The BERT features are used as inputs to ASR systems, rather than the usual log-mel features.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 3,
      "text": "The idea, which combines those of previous work (wav2vec and BERT) synergetically, is intuitive and clearly presented, significant improvements over log-mel and wav2vec were achieved on ASR benchmarks WSJ and TIMIT.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 4,
      "text": "Based on these merits, I suggest this paper to be accepted.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_positive"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 5,
      "text": "On the other hand, I would suggest directions for investigation and improvements as follows.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 6,
      "text": "1. While I understand that vector quantization makes the use of NLP-style BERT-training possible (as the inputs to NLP models are discrete tokens),  there are potential disadvantages as well.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 7,
      "text": "One observation from the submission is that the token set may need to very large (from tens of thousands to millions) for the system to work well, making the BERT training computationally expensive (I noticed that the BERT model is trained on 128 GPUs)",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 8,
      "text": ".",
      "suffix": "",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 9,
      "text": "Also, without BERT pre-training, using directly the discrete tokens seems to consistently give worse performance for ASR.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 10,
      "text": "I think some more motivations or explorations (what kind of information did BERT learn) are needed to understand why that is the case.",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_experiment",
      "aspect": "asp_substance",
      "polarity": "none"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 11,
      "text": "2. Besides the computational expensive-ness of the three-step approach (vector quantization, BERT, acoustic model training), the combined model complexity is large because these steps do not share neural network architecture.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 12,
      "text": "A more economical approach is to use BERT-trained model as initialization for acoustic model training, which is the classical way how RBMs pre-training were used in ASR.",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_soundness-correctness",
      "polarity": "none"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 13,
      "text": "3. One concern I have with discrete representation is how robust they are wrt different dataset.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 14,
      "text": "The ASR datasets used in this work are relatively clean (but there does exists domain difference between them).",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 15,
      "text": "It remains to see how the method performs with more acoustically-challenging speech data, and how universally useful the learned features are (as is the case for BERT in NLP).",
      "suffix": "\n\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 16,
      "text": "4. Another curious question is whether the features would still provide as much improvement when a stronger ASR system than AutoSeg (e.g., Lattice-free MMI) is used.",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_clarity",
      "polarity": "none"
    },
    {
      "review_id": "rJxLRp2s9H",
      "sentence_index": 17,
      "text": "Overall, while I think the computational cost of the proposed method is high, rendering it less practical at this point, I believe the approach has potential and the result obtained so far is already significant.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_motivation-impact",
      "polarity": "pol_positive"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 0,
      "text": "Thank you for your fruitful comments.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 1,
      "text": ">> 1.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          6,
          7,
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 2,
      "text": "[...]",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          6,
          7,
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 3,
      "text": "One observation from the submission is that the token set may need to be very large (from tens of thousands to millions) for the system to work well, making the BERT training computationally expensive [...] I think some more motivation or exploration (what kind of information did BERT learn) is needed to understand why that is the case.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          6,
          7,
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 4,
      "text": "Our BERT vocabulary sizes (13.5k for the gumbel version and 23k for the k-means version) compare favorably to the setups commonly used in NLP where vocabularies are double or triple of our sizes.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          6,
          7,
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 5,
      "text": "We agree that it would be interesting to perform an in-depth analysis on the embeddings learned by BERT and we will investigate this in future work.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_future",
      "alignment": [
        "context_sentences",
        [
          6,
          7,
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 6,
      "text": "Here we focus on a new quantization method evaluated via downstream performance in phone and speech recognition settings by employing models that worked well (and were extensively tuned) in NLP contexts.",
      "suffix": "\n\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          6,
          7,
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 7,
      "text": ">> 2.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 8,
      "text": "A more economical approach is to use BERT-trained model as initialization for acoustic model training, which is the classical way how RBMs pre-training were used in ASR.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 9,
      "text": "Yes, this is an interesting avenue for future work!",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_future",
      "alignment": [
        "context_sentences",
        [
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 10,
      "text": "We did not follow this direction due to two motivations: first, our aim is to contribute a new quantization scheme for audio data that is trained to predict the context in a self-supervised way.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 11,
      "text": "Second, we wanted to show that good performance can be achieved with discretized audio on actual speech tasks.",
      "suffix": "\n\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 12,
      "text": ">> 3. One concern I have with discrete representation is how robust they are wrt different dataset.",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          13,
          14,
          15
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 13,
      "text": "We agree that an ablation study on robustness of the embeddings across different datasets would be very interesting.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          13,
          14,
          15
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 14,
      "text": "Here we are mostly focusing on relatively clean data (WSJ, TIMIT, Librispeech) following the original wav2vec paper but we would be interested in exploring robustness in the future.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_future",
      "alignment": [
        "context_sentences",
        [
          13,
          14,
          15
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 15,
      "text": "However we note that representations transfer at least well across datasets from the \u201cclean speech\u201d domain: vq-wav2vec and BERT is only trained on Librispeech and never tuned on TIMIT/WSJ.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          13,
          14,
          15
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 16,
      "text": ">> 4.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 17,
      "text": "Another curious question is whether the features would still provide as much improvement when a stronger ASR system than AutoSeg (e.g., Lattice-free MMI) is used.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 18,
      "text": "The original wav2vec paper (Schneider et al., 2019) reports better results than LF-MMI on the WSJ benchmark, however, the two setups are not strictly comparable.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 19,
      "text": "In some sense, the LF-MMI result has an edge because it is based on a phoneme-based ASR system which is typically stronger than the character-based ASR system used with wav2vec.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJxLRp2s9H",
      "rebuttal_id": "ByeQLeYhsH",
      "sentence_index": 20,
      "text": "We agree that evaluation on stronger baselines is an important future direction though.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_future",
      "alignment": [
        "context_sentences",
        [
          16
        ]
      ],
      "details": {}
    }
  ]
}