{
  "metadata": {
    "forum_id": "Byg1v1HKDB",
    "review_id": "BJlqUqW3tH",
    "rebuttal_id": "H1gU-PuqiS",
    "title": "Abductive Commonsense Reasoning",
    "reviewer": "AnonReviewer3",
    "rating": 6,
    "conference": "ICLR2020",
    "permalink": "https://openreview.net/forum?id=Byg1v1HKDB&noteId=H1gU-PuqiS",
    "annotator": "anno2"
  },
  "review_sentences": [
    {
      "review_id": "BJlqUqW3tH",
      "sentence_index": 0,
      "text": "Summary: the paper purposes a dataset of abductive language inference and generation.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJlqUqW3tH",
      "sentence_index": 1,
      "text": "The dataset is generated by human, while the testing set is adversarially selected using BERT.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJlqUqW3tH",
      "sentence_index": 2,
      "text": "The paper experiments the popular deep learning models on the dataset and observe shortcoming of deep learning on this task.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJlqUqW3tH",
      "sentence_index": 3,
      "text": "Comments: overall, the problem on abductive inference and abductive generation in language in very interesting and important.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_motivation-impact",
      "polarity": "pol_positive"
    },
    {
      "review_id": "BJlqUqW3tH",
      "sentence_index": 4,
      "text": "This dataset seems valuable. And the paper is simple and well-written.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_positive"
    },
    {
      "review_id": "BJlqUqW3tH",
      "sentence_index": 5,
      "text": "Concerns: I find the claim on deep networks kind of irresponsible.",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJlqUqW3tH",
      "sentence_index": 6,
      "text": "1. The dataset is adversarially filtered using BERT and GPT, which gives deep learning model a huge disadvantage. After all, the paper says BERT scores 88% before the dataset is attacked.",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJlqUqW3tH",
      "sentence_index": 7,
      "text": "2. The human score of 91.4% is based on majority vote, which should be compared with an ensemble of deep learning prediction.",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_experiment",
      "aspect": "asp_meaningful-comparison",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJlqUqW3tH",
      "sentence_index": 8,
      "text": "To compare the author should use the average score of human.",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_experiment",
      "aspect": "asp_meaningful-comparison",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJlqUqW3tH",
      "sentence_index": 9,
      "text": "3. The ground truth is selected by human.",
      "suffix": "\n\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJlqUqW3tH",
      "sentence_index": 10,
      "text": "On a high level, the main difficulty of abduction is to search in the exponentially large space of hypothesis.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJlqUqW3tH",
      "sentence_index": 11,
      "text": "Formulating the abduction task as a (binary) classification problem is less interesting.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJlqUqW3tH",
      "sentence_index": 12,
      "text": "The generative task is a better option.",
      "suffix": "\n\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJlqUqW3tH",
      "sentence_index": 13,
      "text": "Decision: despite the seeming unfair comparison, this task is novel. I vote for weak accept.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 0,
      "text": "We appreciate AnonReviewer3 for encouraging comments about the importance of the proposed abductive inference and generation tasks and about the value of our proposed dataset.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_accept-praise",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 1,
      "text": "We address the main concerns individually below:",
      "suffix": "\n\n\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 2,
      "text": "Adversarially filtering using BERT and GPT gives deep learning models a disadvantage:",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 3,
      "text": "While BERT originally achieved high performance on the originally collected dataset, several recent studies [1][2][3][4] have found the presence of annotation artifacts in crowdsourced data that inadvertently leak information about the target label.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 4,
      "text": "This subsequently leads to overestimation of the performance of AI systems on end tasks.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 5,
      "text": "Our adversarial filtering (AF) algorithm aims to address the problem of overestimation of performance.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 6,
      "text": "In spite of targeting GPT/BERT during AF, human performance on the AF resulting dataset is still high.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 7,
      "text": "The significant gap between human and BERT performance leaves scope for inventing new methods for abductive reasoning.",
      "suffix": "\n\n\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 8,
      "text": "Ensemble of BERT models:",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 9,
      "text": "An ensemble of three BERT models achieves an accuracy of 68.9%, very close to a single model 68.6%.",
      "suffix": "\n\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 10,
      "text": "Average score of human:",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 11,
      "text": "The average score of human annotations is 89.4%.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 12,
      "text": "This is directly comparable with BERT-Ft [Fully Connected] model\u2019s performance of 68.6% in Table 1.",
      "suffix": "\n\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 13,
      "text": "Re. Ground Truth:",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 14,
      "text": "The ground truth is assigned based on whether a hypothesis was collected during the plausible (Appendix A1 Task1) or implausible (Appendix A1 Task2) phase of the data collection procedure.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 15,
      "text": "To measure human performance, we had three annotators select the correct hypothesis and measured human performance as the accuracy of their majority-vote.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 16,
      "text": "Please let us know if this answers your question. If not, could you please clarify your question?",
      "suffix": "\n\n\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_followup",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 17,
      "text": "Generative task vs classification:",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 18,
      "text": "We completely agree.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 19,
      "text": "While the generative task is more general and much more interesting, the challenge of evaluating generations is significant, particularly for this task.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 20,
      "text": "This is due to the fact that there could be multiple distinct plausible explanations for a given pair of hypothesis.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 21,
      "text": "Consider the following example:",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 22,
      "text": "O1: Kelly and her friend wanted to take a train to the city.",
      "suffix": "\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 23,
      "text": "O2: They had to wait for another one.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 24,
      "text": "Plausible explanations:",
      "suffix": "\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 25,
      "text": "1. They read the timetable incorrectly and arrived at the station just after a train had left.",
      "suffix": "\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 26,
      "text": "2. The train was full.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 27,
      "text": "Both explanations are plausible, and explain the observations, but automated evaluation metrics are not reliable enough to capture this phenomenon based on their reliance on surface level similarities.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 28,
      "text": "To simultaneously make progress on the novel abductive reasoning task and due to the ease of evaluation, we additionally introduce a discriminative version of the task.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 29,
      "text": "Nonetheless, we agree that in its most general form, there could be any number of observations and models should be required to generate explanatory hypotheses in natural language (alpha-NLG task).",
      "suffix": "\n\n\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 30,
      "text": "[1] Gururangan et al. Annotation artifacts in natural language inference data.",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 31,
      "text": "[2] Poliak et al. Hypothesis only baselines in natural language inference.",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 32,
      "text": "[3] Tsuchiya et al. Performance impact caused by hidden bias of training data for recognizing textual entailment.",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJlqUqW3tH",
      "rebuttal_id": "H1gU-PuqiS",
      "sentence_index": 33,
      "text": "[4] Sakaguchi et al. WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    }
  ]
}