{
  "metadata": {
    "forum_id": "H1xwNhCcYm",
    "review_id": "BJe__C5d3Q",
    "rebuttal_id": "ByxeVRhY0m",
    "title": "Do Deep Generative Models Know What They Don't Know? ",
    "reviewer": "AnonReviewer2",
    "rating": 6,
    "conference": "ICLR2019",
    "permalink": "https://openreview.net/forum?id=H1xwNhCcYm&noteId=ByxeVRhY0m",
    "annotator": "anno3"
  },
  "review_sentences": [
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 0,
      "text": "This paper displays an occurrence of density models assigning higher likelihood to out-of-distribution inputs compared to the training distribution.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 1,
      "text": "Specifically, density models trained on CIFAR10 have higher likelihood on SVHN than CIFAR10.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 2,
      "text": "This is an interesting observation because the prevailing assumption is that density models can distinguish inliers from outliers.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_originality",
      "polarity": "pol_positive"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 3,
      "text": "However, this phenomenon is not encountered when comparing MNIST and NotMNIST.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_meaningful-comparison",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 4,
      "text": "The SVHN/CIFAR10 phenomenon has also been shown in concurrent work [1].",
      "suffix": "\n\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 5,
      "text": "Given that you observed that SVHN has higher likelihood on all three model types (PixelCNN, VAE, Glow), why investigate a component specific to just flow-based models (the volume term)?",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_result",
      "aspect": "asp_meaningful-comparison",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 6,
      "text": "It seems reasonable to suspect that the phenomenon may be due to a common cause in all three model types.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 7,
      "text": "For instance, the experiments seem to indicate that generalizing density estimation from CIFAR training set to CIFAR test set is likely challenging and thus the models underfit the true data distribution, resulting in the simpler dataset (SVHN) having higher likelihood.",
      "suffix": "\n\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 8,
      "text": "Given the title of the paper, it would have been nice if this paper explored more than just MNIST vs NotMNIST and SVHN vs CIFAR10, so that the readers can gain a better feel for when generative models will be able to detect outliers.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_replicability",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 9,
      "text": "For instance, a scenario where the data statistics (pixel means and variances) are nearly equivalent for both datasets would be interesting.",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_experiment",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 10,
      "text": "The second order analysis is good but it seems to come down to just a measure of the empirical variances of the datasets.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_positive"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 11,
      "text": "This paper is well written.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_positive"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 12,
      "text": "I think the presentation of this density modelling shortcoming is a good contribution but leaves a bit to be desired.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_motivation-impact",
      "polarity": "pol_positive"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 13,
      "text": "[1] Choi, H. and Jang, E. Generative Ensembles for Robust Anomaly Detection.",
      "suffix": "",
      "review_action": "arg_other",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 14,
      "text": "https://arxiv.org/abs/1810.01392",
      "suffix": "\n\n\n",
      "review_action": "arg_other",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 15,
      "text": "Pros:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 16,
      "text": "- Interesting observation of density modelling shortcoming",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_motivation-impact",
      "polarity": "pol_positive"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 17,
      "text": "- Clear presentation",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_positive"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 18,
      "text": "Cons:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 19,
      "text": "- Lack of a strong explanation for the results or a solution to the problem",
      "suffix": "\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "BJe__C5d3Q",
      "sentence_index": 20,
      "text": "- Lack of an extensive exploration of datasets",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_replicability",
      "polarity": "pol_negative"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 0,
      "text": "Thanks again, Reviewer #2, for your insightful feedback.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 1,
      "text": "We respond to your other comments below.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 2,
      "text": "1.  \u201cWhy investigate a component specific to just flow-based models (the volume term)? It seems reasonable to suspect that the phenomenon may be due to a common cause in all three model types.\u201d",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          5,
          6
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 3,
      "text": "See general response #3.",
      "suffix": "\n\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 4,
      "text": "2.  \u201cFor instance, the experiments seem to indicate that generalizing density estimation from CIFAR training set to CIFAR test set is likely challenging and thus the models underfit the true data distribution, resulting in the simpler dataset (SVHN) having higher likelihood.\u201c",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 5,
      "text": "We do not believe our models are necessarily underfit.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 6,
      "text": "In fact, we found that Glow had a tendency to *overfit,* and that one must carefully set Glow\u2019s l2 penalty and choose its scale parametrization (exp vs sigmoid, see Appendix D) in order to prevent it from doing so.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 7,
      "text": "We thought this overfitting to the training data could be a reason for the phenomenon and therefore we tuned our implementations to have reasonable generalization.",
      "suffix": "\n\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 8,
      "text": "3.  \u201cIt would have been nice if this paper explored more than just MNIST vs NotMNIST and SVHN vs CIFAR10, so that the readers can gain a better feel for when generative models will be able to detect outliers.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          8
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 9,
      "text": "For instance, a scenario where the data statistics (pixel means and variances) are nearly equivalent for both datasets would be interesting.\u201d",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 10,
      "text": "See general response #1 in regards to data sets and additional results.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 11,
      "text": "Thank you for the suggestion of looking at data sets with similar statistics.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 12,
      "text": "We do this, in a way, with our second order analysis and the \u2018gray-ing\u2019 experiment in Figure 5 (b) (formerly Figure 6 (b) in the original draft).",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 13,
      "text": "Gray CIFAR-10 (blue dotted line) nearly overlaps with original SVHN (red solid line) in terms of their log p(x) evaluations.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 14,
      "text": "Figure 12 (formerly Figure 13) then shows the latent (empirical) distribution of the gray images, and we see that the gray CIFAR-10 latent variables nearly overlap with the SVHN latent variables.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 15,
      "text": "This is to be expected though, given the overlapping p(x) histograms, since the probability assigned by CV-Glow (in comparison to other inputs) is fully determined by the position in latent space.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          20
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 16,
      "text": "4.  \u201cThe second order analysis is good but it seems to come down to just a measure of the empirical variances of the datasets.\u201d",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "BJe__C5d3Q",
      "rebuttal_id": "ByxeVRhY0m",
      "sentence_index": 17,
      "text": "See general response #2.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    }
  ]
}