{
  "metadata": {
    "forum_id": "SkeHuCVFDr",
    "review_id": "HJgQNGqJ9r",
    "rebuttal_id": "HyxKnrzGsS",
    "title": "BERTScore: Evaluating Text Generation with BERT",
    "reviewer": "AnonReviewer3",
    "rating": 6,
    "conference": "ICLR2020",
    "permalink": "https://openreview.net/forum?id=SkeHuCVFDr&noteId=HyxKnrzGsS",
    "annotator": "anno3"
  },
  "review_sentences": [
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 0,
      "text": "Paper Contributions",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 1,
      "text": "This paper introduces a new text generation scoring approach using BERT, called BERTScore.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 2,
      "text": "Using BERT embeddings and optionally idf scores, a greedy matching is performed between all reference and candidate words, with cosine similarity between vector representations as the scoring.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 3,
      "text": "From this, a precision, recall and F1 score can be derived.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 4,
      "text": "This notably outperforms BLEU, as well as other metrics, most but not all of the time.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_positive"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 5,
      "text": "The paper offers a broad range of comparisons and analysis.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_meaningful-comparison",
      "polarity": "pol_positive"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 6,
      "text": "Decision",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 7,
      "text": "I'm leaning towards accepting the paper on the basis of the following.",
      "suffix": "\n\n",
      "review_action": "arg_social",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 8,
      "text": "Strong points taken in consideration:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 9,
      "text": "- Simple, well-motivated metric that uses powerful BERT-style models, without being slow to compute either.",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_motivation-impact",
      "polarity": "pol_positive"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 10,
      "text": "- Good performance empirically on WMT. I'm less convinced on COCO since using the image is fair game there.",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_positive"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 11,
      "text": "- Code is provided, and it is simple and adaptable for future work.",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_replicability",
      "polarity": "pol_positive"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 12,
      "text": "- Experimentation is detailed and reproducible.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_positive"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 13,
      "text": "Weaker points taken in consideration:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 14,
      "text": "- Work conducted in parallel matches or exceeds the performance of BERTScore.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 15,
      "text": "This shouldn't necessarily be a reason to choose not to publish this work in my opinion, but it should be taken into consideration.",
      "suffix": "",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 16,
      "text": "I like that the authors were open and clear regarding this in their discussion.",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_positive"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 17,
      "text": "- The authors haven't come up with a recommendation for a single configuration of their approach.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 18,
      "text": "In one place they recommend F-BERT without idf, in another they argue for picking and choosing based on context, with little help about how to choose.",
      "suffix": "",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 19,
      "text": "I think practitioners are only going to be willing to switch away from BLEU, for example, if a single one-size-fits-all metric is proposed instead.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 20,
      "text": "I identify this ambiguity between BERTScore versions as an important weakness of the paper.",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 21,
      "text": "- It's unclear throughout whether words or wordpieces are the main token being considered.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 22,
      "text": "Most discussion and definitions use \"words\", but in section 3, subsection Token Representation, it appears to be clearly stated that BERTScore uses a BERT model based on word pieces.",
      "suffix": "",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 23,
      "text": "I recommend adjust the language to be more consistent throughout.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 24,
      "text": "Also, scoring examples with word pieces would be more consistent with this as well, imo.",
      "suffix": "",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 25,
      "text": "Notably, I'm actually unsure whether you compute IDF over words or word pieces, and how this is applied.",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 26,
      "text": "- Finally, I found some weaknesses in the Importance Weighting section (though this isn't too important since IDF isn't part of the recommended BERTScore I believe).",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 27,
      "text": "The IDF scores would be stronger if they were computed on a bigger in-domain corpus than the gold test set.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_replicability",
      "polarity": "pol_negative"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 28,
      "text": "This would add extra steps to using BERTScore though and make things more complicated in practice, but this should nevertheless probably be tried, or at least discussed in the paper.",
      "suffix": "",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 29,
      "text": "Also, the plus-one smoothing handles unknown words (or word piece?) and I'm not sure why. If we're using the test set to compute IDF, and the sentences we're looking *are* in the test set, then there shouldn't be unknown words and no smoothing is required.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 30,
      "text": "So overall, I still think this deserves publication because it's valuable information for researchers, and the metric itself could be immediately useful to some as well.",
      "suffix": "",
      "review_action": "arg_social",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HJgQNGqJ9r",
      "sentence_index": 31,
      "text": "However, the weaknesses mentioned make me hesitate to fully endorse the work.",
      "suffix": "",
      "review_action": "arg_social",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 0,
      "text": "Thank you for your comments!",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 1,
      "text": "We agree with R3 that it would be ideal to have a one-size-fits-all metric.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          19
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 2,
      "text": "Unfortunately, the complex landscape of the problem doesn\u2019t permit a single recommendation.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          19
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 3,
      "text": "We did our best to conduct a detailed and honest study.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          19
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 4,
      "text": "We believe our experiments to be some of the most extensive in this area, and we hope they will contribute to researchers\u2019 understanding of the problem.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          19
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 5,
      "text": "It\u2019s important to note, though, that BERTScore is an improvement over the commonly used Bleu across the board.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          19,
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 6,
      "text": "Our recommendation to use F1, while potentially not optimal in specific cases, generally performs very well and much better than Bleu.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          19,
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 7,
      "text": "There are largely two sets of options, (1) Among P, R, F; and  (2) What model to use.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          19,
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 8,
      "text": "For (1), as we specify, F-BERT is a reliable metric for MT.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          19,
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 9,
      "text": "For (2), Roberta-Large performs consistently well for to-English language pairs.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          19,
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 10,
      "text": "The results are less conclusive for from-English language pairs.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          19,
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 11,
      "text": "BERTScore computed with Multilingual-BERT is better than most existing metrics except on few low-resource languages.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          17,
          18,
          19,
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 12,
      "text": "We have updated the paper with these recommendations in Section 7.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          17,
          18,
          19,
          20
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 13,
      "text": "We are using word pieces in all experiments, and we compute IDF using word pieces.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          21,
          22,
          23,
          24,
          25
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 14,
      "text": "We updated the paper to make this clear in Section 3, under Importance Weighting.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          21,
          22,
          23,
          24,
          25
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 15,
      "text": "Regarding unknown words handling, we computed the IDF on the reference sentences in the test set.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          21,
          22,
          23,
          24,
          25
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 16,
      "text": "This ensures that the IDF is the same for all MT systems that are tested.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          21,
          22,
          23,
          24,
          25
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 17,
      "text": "The candidate sentences generated by MT systems may contain words that never appear in the test set.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          26,
          27,
          28,
          29
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 18,
      "text": "We apply plus-one smoothing to handle such words.",
      "suffix": "\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          26,
          27,
          28,
          29
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 19,
      "text": "Following your suggestion, we further studied idf scoring.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          26,
          27,
          28,
          29
        ]
      ],
      "details": {
        "request_out_of_scope": false
      }
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 20,
      "text": "We computed idf scores on the monolingual English corpus released by WMT18 and experimented with BERTScore computed with the Roberta-large model.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          26,
          27,
          28,
          29
        ]
      ],
      "details": {
        "request_out_of_scope": false
      }
    },
    {
      "review_id": "HJgQNGqJ9r",
      "rebuttal_id": "HyxKnrzGsS",
      "sentence_index": 21,
      "text": "We have found that this leads to worse performance, likely because of the domain shift.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          26,
          27,
          28,
          29
        ]
      ],
      "details": {}
    }
  ]
}