{
  "metadata": {
    "forum_id": "Byl8hhNYPS",
    "review_id": "rJe9E2ITYr",
    "rebuttal_id": "rJgabTX4sr",
    "title": "Neural Machine Translation with Universal Visual Representation",
    "reviewer": "AnonReviewer1",
    "rating": 6,
    "conference": "ICLR2020",
    "permalink": "https://openreview.net/forum?id=Byl8hhNYPS&noteId=rJgabTX4sr",
    "annotator": "anno13"
  },
  "review_sentences": [
    {
      "review_id": "rJe9E2ITYr",
      "sentence_index": 0,
      "text": "The authors propose to augment NMT with a grounded inventory of images.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJe9E2ITYr",
      "sentence_index": 1,
      "text": "The intuition is clear and the premise is very tempting.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    },
    {
      "review_id": "rJe9E2ITYr",
      "sentence_index": 2,
      "text": "The key architectural choice is to allow the transformer to use language embeddings to attend into a topic-image lookup table.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJe9E2ITYr",
      "sentence_index": 3,
      "text": "The proportion is learned to balance how much signal comes from each source.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJe9E2ITYr",
      "sentence_index": 4,
      "text": "Figure 4, attempts to investigate the importance of this sharing and its effects on performance.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJe9E2ITYr",
      "sentence_index": 5,
      "text": "While reviewing this paper I went back and read the EN-DE evaluation data for the last few years trying to see how often I could reason that images would help and I came up severely lacking.",
      "suffix": "",
      "review_action": "arg_other",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJe9E2ITYr",
      "sentence_index": 6,
      "text": "For example, \"The old system of private arbitration courts is off the table\" from DE-EN 2016 Dev doesn't seem like it should benefit from this architecture.",
      "suffix": "",
      "review_action": "arg_other",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "rJe9E2ITYr",
      "sentence_index": 7,
      "text": "It's then hard for me to square that with the +VR gains seen throughout this work on non-grounded datasets.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "rJe9E2ITYr",
      "sentence_index": 8,
      "text": "I trust that the authors did in fact achieve these results but I cannot figure out how or why.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "rJe9E2ITYr",
      "sentence_index": 9,
      "text": "This is all further confused by the semantic topics used for clustering the images which ignores stop words and therefore spatial relations or any grammatical nuances.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "rJe9E2ITYr",
      "sentence_index": 10,
      "text": "In contrast, it does make sense that Multi30K would benefit from this architecture.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_positive"
    },
    {
      "review_id": "rJe9E2ITYr",
      "sentence_index": 11,
      "text": "As a minor note, were different feature extractors compared?",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_clarification",
      "aspect": "asp_substance",
      "polarity": "none"
    },
    {
      "review_id": "rJe9E2ITYr",
      "sentence_index": 12,
      "text": "The recent flurry of papers on multimodal transformers indicate that deeper resnet stacks correspond to improved downstream performance.",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_clarification",
      "aspect": "asp_substance",
      "polarity": "none"
    },
    {
      "review_id": "rJe9E2ITYr",
      "sentence_index": 13,
      "text": "Is that also true in this domain?",
      "suffix": "",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 0,
      "text": "Thanks for your insightful comments.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 1,
      "text": "1. How or why is the benefit.",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 2,
      "text": "This comment is insightful and we also considered about it.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 3,
      "text": "Intuitively, we would easily fall into the connections between each sentence and image.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 4,
      "text": "However, it is nearly impossible to pair sentence with images with completely the same meaning all the time.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 5,
      "text": "According to our investigation, we conclude that the major contribution would be more effective contextualized sentence encoding for better representation from the visual clue combination instead of single image enhancement for encoding each individual sentence or word.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 6,
      "text": "According to Distributional Hypothesis (Harris et al., 1954) which states that \u201cwords that occur in similar contexts tend to have similar meanings\u201d, we are inspired to extend the concept in multimodal world, \u201cthe sentences with similar meanings would be likely to pair with similar even the same images\u201d, where the consistent images (with similar topic) could play the role of topic or type clues for similar sentence modeling.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 7,
      "text": "For your example, the topic words are {private, courts, table}, which can be paired with relevant images and other sentences with the same (similar) topics will be paired with the same (similar) group of images.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 8,
      "text": "This is also very similar to the idea of word embedding by taking each image as a \u201cword\u201d.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 9,
      "text": "Because we use the average pooled output of ResNet, each image is represented as 2400d vector.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 10,
      "text": "For all the 29,000 images, we have an embedding layer with size (29000, 2400).",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 11,
      "text": "The \u201ccontent\u201d of the image is just like the embedding initialization.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 12,
      "text": "It indeed makes effects, but the capacity of the neural network is not up to it.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 13,
      "text": "In contrast, the mapping from text word to the index in the word embedding is critical.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 14,
      "text": "Similarly, the mapping of sentence to image in image embedding would be essential, i.e., the similar sentences (with the same topic words) tend to map the similar images.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 15,
      "text": "To verify the hypothesis, we shuffle the image embeddings but keep the lookup table, to only exchange the features of each image but maintain the sentence-image mapping.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 16,
      "text": "Unsurprisingly, the BLEU score (EN-RO) is 33.53, which is very close to the reported one (33.78).",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 17,
      "text": "In addition, we randomly initialize the image embedding instead of ResNet, the result is 33.28.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 18,
      "text": "In comparison, if we randomly retrieve unrelated images to break the lookup, the result is 32.14.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 19,
      "text": "These results verify the necessity of the lookup table.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 20,
      "text": "We have added a detailed discussion in the paper (please see Analysis 6.1).",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 21,
      "text": "We believe this finding would be suggestive for the future research since most previous work focused on the content of the image itself.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_future",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 22,
      "text": "As a different research line, we highlight the consistency among the mono-modality to bridge the gap of language and image modeling.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_future",
      "alignment": [
        "context_sentences",
        [
          5,
          6,
          7,
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 23,
      "text": "2. Why stop words are ignored.",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 24,
      "text": "According to the explanation above, we think the spatial relations or grammatical nuances would not be so important in this task if we take the images as topic guidance.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 25,
      "text": "Ignoring the stopwords can help us get rid of the disturbance of unnecessary high-frequency words (such as function words) being the topic, as the standard practice for TF-IDF topic extraction.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 26,
      "text": "3. Comparison of different feature extractors.",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 27,
      "text": "Yes. We compared with ResNet101 and ResNet152 on EN-RO.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 28,
      "text": "The BLEU scores are 33.63 and 33.87.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "rJe9E2ITYr",
      "rebuttal_id": "rJgabTX4sr",
      "sentence_index": 29,
      "text": "It seems deeper ResNet indeed gives better results but the difference is not very significant.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          11,
          12,
          13
        ]
      ],
      "details": {}
    }
  ]
}