{
  "metadata": {
    "forum_id": "rylT0AVtwH",
    "review_id": "H1ly2z2RFS",
    "rebuttal_id": "HJlxtt_Osr",
    "title": "Learning from Partially-Observed Multimodal Data with Variational Autoencoders",
    "reviewer": "AnonReviewer2",
    "rating": 3,
    "conference": "ICLR2020",
    "permalink": "https://openreview.net/forum?id=rylT0AVtwH&noteId=HJlxtt_Osr",
    "annotator": "anno2"
  },
  "review_sentences": [
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 0,
      "text": "Summary:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 1,
      "text": "This paper proposes to impute multimodal data when certain modalities are present.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 2,
      "text": "The authors present a variational selective autoencoder model that learns only from partially-observed data.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 3,
      "text": "VSAE is capable of learning the joint",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 4,
      "text": "distribution of observed and unobserved modalities as well as the imputation mask, resulting in a model that is suitable for various down-stream tasks including data generation and imputation",
      "suffix": "",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 5,
      "text": ".",
      "suffix": "",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 6,
      "text": "The authors evaluate on both synthetic high-dimensional and challenging low-dimensional multimodal datasets and show improvement over the state-of-the-art data imputation models.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 7,
      "text": "Strengths:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 8,
      "text": "- This is an interesting paper that is well written and motivated.",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_motivation-impact",
      "polarity": "pol_positive"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 9,
      "text": "- The authors show good results on several multimodal datasets, improving upon several recent works in learning from missing multimodal data.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_positive"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 10,
      "text": "Weaknesses:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 11,
      "text": "- How multimodal are the datasets provided by UCI? It seems like they consist of different tabular datasets with numerical or categorical variables, but it was not clear what the modalities are (each variable is a modality?) and how correlated the modalities are. If they are not correlated at all and share no joint information I'm not sure how these experiments can represent multimodal data.",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_clarification",
      "aspect": "asp_soundness-correctness",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 12,
      "text": "- Some of the datasets the authors currently test on are quite toy, especially for the image-based MNIST and SVHN datasets.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 13,
      "text": "They should consider larger-scale datasets including image and text-based like VQA/VCR, or video-based like the datasets in (Tsai et al., ICLR 2019).",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_experiment",
      "aspect": "asp_substance",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 14,
      "text": "- In terms of prediction performance, the authors should also compare to [1] and [2] which either predict the other modalities completely during training or use tensor-based methods to learn from noisy or missing time-series data.",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_meaningful-comparison",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 15,
      "text": "- One drawback is that this method requires the mask during training. How can it be adapted for scenarios where the mask is not present? In other words, we only see multiple modalities as input, but we are not sure which are noisy and which are not?",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 16,
      "text": "[1] Pham et al. Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities, AAAI 2019",
      "suffix": "\n",
      "review_action": "arg_other",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 17,
      "text": "[2] Liang et al. Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization, ACL 2019",
      "suffix": "\n\n",
      "review_action": "arg_other",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 18,
      "text": "### Post rebuttal",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 19,
      "text": "#",
      "suffix": "",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 20,
      "text": "##",
      "suffix": "\n",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "H1ly2z2RFS",
      "sentence_index": 21,
      "text": "Thank you for your detailed answers to my questions.",
      "suffix": "",
      "review_action": "arg_social",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 0,
      "text": "(1) Multimodal setting:",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          11
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 1,
      "text": "We apologize for not describing experimental settings clearly.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          11
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 2,
      "text": "In general, we believe multi-modal data is more general than simply image-text or video-text pair.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          11
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 3,
      "text": "By unifying tabular data also as multi-modal data (with each attribute as one modality), we show that VASE provides us a principled way for imputation and is capable of generalizing to more data families.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          11
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 4,
      "text": "We update additional multimodal dataset experiments in the point (3) below.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          11
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 5,
      "text": "(2) Prediction and Representation learning:",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 6,
      "text": "We consider conducting these experiments during the rebuttal but none of the paper's code has been released by the authors.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-request",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {
        "request_out_of_scope": false
      }
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 7,
      "text": "We agree deep latent variable models explicitly model the data distribution and provide a natural way for representation learning, but in our paper we evaluate the model from the perspective of imputation and generation.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          14
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 8,
      "text": "(3) Additional experiments:",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 9,
      "text": "We updated additional imputation experiments on multimodal datasets (see in Appendix C.5) : CMU-MOSI/ICT-MMMO (Tsai et al. 2019), FashionMNIST/MNIST (Wu et al. 2018).",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          12
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 10,
      "text": "Each dataset contains two or three modalities.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 11,
      "text": "VSAE outperforms other baselines on multimodal datasets under partially-observed setting.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          12
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 12,
      "text": "(4) Require mask during training:",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          15
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 13,
      "text": "In our experiments, the binary mask is always fully-observed as is the nature of partially-observed data.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          15
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 14,
      "text": "A mask simply indicates which  modalities are observed and which are not.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          15
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 15,
      "text": "We agree that it is very interesting to design a model with partially-observed or even unobserved mask.",
      "suffix": "\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          15
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 16,
      "text": "However, it is beyond the scope of this work and we will consider it in future work.",
      "suffix": "\n\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_future",
      "alignment": [
        "context_sentences",
        [
          15
        ]
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 17,
      "text": "[1] Wu et al. Multimodal Generative Models for Scalable Weakly-Supervised Learning, NeurIPS 2018.",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "H1ly2z2RFS",
      "rebuttal_id": "HJlxtt_Osr",
      "sentence_index": 18,
      "text": "[2] Tsai et al. Learning Factorized Multimodal Representation, ICLR 2019.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    }
  ]
}