{
  "metadata": {
    "forum_id": "HyEl3o05Fm",
    "review_id": "r1gzmeTDnm",
    "rebuttal_id": "rJexhWRXCm",
    "title": "Stochastic Adversarial Video Prediction",
    "reviewer": "AnonReviewer1",
    "rating": 5,
    "conference": "ICLR2019",
    "permalink": "https://openreview.net/forum?id=HyEl3o05Fm&noteId=rJexhWRXCm",
    "annotator": "anno3"
  },
  "review_sentences": [
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 0,
      "text": "Summary:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 1,
      "text": "The authors present a video prediction model called SAVP that combines a Variational Auto-Encoder (VAE) model with a Generative Adversarial Network (GAN) to produce more realistic and diverse future samples.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 2,
      "text": "Deterministic models and certain loss functions such as Mean Squared Error (MSE) will produce",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 3,
      "text": "blurry results when making uncertain predictions.",
      "suffix": "",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 4,
      "text": "GAN predictions on the other hand usually are more visually appealing but often lack diversity, producing just a few modes.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 5,
      "text": "The authors propose to combine a VAE model with a GAN objective to combine their strengths: good quality samples (GAN) that cover multiple possible futures (VAE).",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 6,
      "text": "Strengths:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 7,
      "text": "[+] GANs are notoriously unstable to train, especially for video.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 8,
      "text": "The authors formulate a VAE-GAN model and successfully implement it.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_positive"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 9,
      "text": "Weaknesses:",
      "suffix": "\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 10,
      "text": "[-] The combination of VAEs and GANs, while new for videos, had already been proposed for image generation as indicated in the Related Work section and its formulation for video prediction is relatively straightforward given existing VAE (Denton & Fergus 2018) and GAN models (Tulyakov et al. 2018).",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "arg_other",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 11,
      "text": "[-] The results indicate that SAVP offers a trade-off between the properties of GANs and VAEs, but does not go beyond its individual parts.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 12,
      "text": "For example, the experiment of Figure 5 does not show SAVP being significantly more diverse than GANs for KTH (as compared to VAEs).",
      "suffix": "",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 13,
      "text": "Furthermore, Figure 6 and Figure 7 in general show SAVP performing worse than SVG (Denton & Fergus 2018), a VAE model with a significantly less complex generator, including for the metric (VGG cosine similarity) that the authors introduce arguing that PSNR and SSIM do not necessarily indicate prediction quality.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 14,
      "text": "While the use of a GAN in general will make the results less blurry and visually appealing, it does not necessarily mean that the samples it generates are going to be plausible or better.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 15,
      "text": "Since a direct application of video prediction is model-based planning, it seems that plausibility might be as important as sample quality.",
      "suffix": "",
      "review_action": "none",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 16,
      "text": "This work proposes to combine VAEs and GANs in a single model to get the benefits of both models.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 17,
      "text": "However, the experiments conducted generally show that SAVP offers only a trade-off between the visual quality of GANs and the coverage of VAEs, and does not show a clear advantage over current VAE models (Denton & Fergus, 2018) that with simpler architectures obtain similar results.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_meaningful-comparison",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 18,
      "text": "While the presentation is clear and the evaluation of the model is thorough, I am unsure of the significance of the proposed method.",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_motivation-impact",
      "polarity": "pol_negative"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 19,
      "text": "In order to better assess this model and compare it to its individual parts and other VAE models, could the authors:",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 20,
      "text": "1) Compare SAVP to the SVG-LP/FP model on a controlled synthetic dataset such as Stochastic Moving MNIST (Denton & Fergus, 2018)?",
      "suffix": "\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_experiment",
      "aspect": "asp_substance",
      "polarity": "none"
    },
    {
      "review_id": "r1gzmeTDnm",
      "sentence_index": 21,
      "text": "2) Comment on the plausibility of the samples generated by SAVP? Do some samples show imagined objects \u2013 implausible interactions for the robotic arm dataset? If so, what would be the advantage over blurry but plausible generations of a VAE?",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_clarification",
      "aspect": "asp_substance",
      "polarity": "none"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 0,
      "text": "We thank reviewer 1 for the detailed feedback.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 1,
      "text": "In this response, we clarify the accuracy-realism trade-off, revise the accuracy metrics, indicate reruns and new experiments, and address the individual questions.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 2,
      "text": "We updated Section 4.4 to indicate that it is to be expected that, although our SAVP model improves on diversity and realism, it also performs worse in accuracy compared to pure VAE models (both our own ablation and SVG).",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          11,
          12,
          13
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 3,
      "text": "A recent result [1] proves that there is a fundamental tradeoff between accuracy and realism, for all problems with inherent ambiguity.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 4,
      "text": "In fact, a recent challenge held at ECCV 2018 in such a problem [2] evaluates all algorithms on both of these axes, as neither adequately captures performance.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 5,
      "text": "Although the SVG generator is simpler than ours, ours is just a simple variation from Ebert et al. (2017).",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 6,
      "text": "Since proposing a strong generator architecture is not the goal of this paper,",
      "suffix": "\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 7,
      "text": "any video generator (including the one from Denton & Fergus (2018)) could be used with our losses.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 8,
      "text": "We added this clarification to Section 3.4.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          11,
          12,
          13
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 9,
      "text": "Instead, we provide a systematic analysis of the effect of the loss function on this task (which could be applied to any generator).",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 10,
      "text": "It's also worth noting that with a simpler feed-forward posterior and a unit Gaussian prior, our VAE ablation and SVG achieve similar performance on various metrics.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 11,
      "text": "We added Section 3.5 to point out the differences between the VAE component of our model and prior work.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          16,
          17,
          18
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 12,
      "text": "We have included a revised plot in Figure 14 (note that this temporary plot will be incorporated into Figure 6), where we use the official implementation of SSIM and replace the VGG metric with the LPIPS metric (Zhang et al., 2018).",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          16,
          17,
          18
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 13,
      "text": "LPIPS linearly calibrates AlexNet feature space to better match human perceptual similarity judgements.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17,
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 14,
      "text": "Aside for the first two predicted frames, our VAE ablation and the SVG model both achieve similar SSIM and LPIPS.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17,
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 15,
      "text": "After examining the KTH results further, we realized that our results are likely weaker than they should have been, because we did not use the same preprocessing as prior work.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17,
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 16,
      "text": "The experiments from our original submission cropped the videos into a square before resizing, and thus discarded information from the sides of the video.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17,
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 17,
      "text": "We are currently rerunning the KTH experiments and we plan to update the results in the paper.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_by-cr",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17,
          18
        ]
      ],
      "details": {
        "manuscript_change": true
      }
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 18,
      "text": "We also didn't choose particular hyperparameters to ensure diversity for our models, and we expect some improvement in diversity in the new sets of experiments.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          15,
          16,
          17,
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 19,
      "text": "Although the combination of VAEs and GANs have been explored recently for conditional image generation (Zhang et al. 2018), the video prediction task is substantially different, with unique challenges, due to spatiotemporal relationships and inherent compounding uncertainty of the future.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 20,
      "text": "Furthermore, while the individual components have indeed been known for video prediction, their combination is novel and not present in prior work, and we demonstrate that this produces state-of-the-art results in terms of diversity and realism.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 21,
      "text": "In addition, this work provides a detailed comparison of the effect of the losses on the various metrics.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 22,
      "text": "Furthermore, we are currently running experiments for various weightings of the KL loss and the adversarial loss, and we plan to include additional results that illustrate the trade-offs based on these hyperparameters.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 23,
      "text": "Although MoCoGAN performs well for videos with a single frame-centered actor, it struggles with multiple simultaneously moving entities.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 24,
      "text": "The authors of MoCoGAN also mentioned in personal correspondence that the conditional version (i.e. video prediction) was significantly harder to train.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 25,
      "text": "We noticed the same in earlier iterations of our model.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 26,
      "text": "In our case, we found that the model would degenerate to static videos or videos with a cyclic flickering artifact, which are issues that aren't a problem in conditional image generation.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 27,
      "text": "We added details to Section 3.4 describing the importance of a few components, such as spectral normalization and not conditioning the discriminator in the ground-truth context frames.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          10
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 28,
      "text": "The purpose of adding adversarial losses to a pure VAE is to improve on blurry predictions where the latent variables alone cannot capture the uncertainty of the data.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          10,
          14,
          15,
          16,
          17,
          18
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 29,
      "text": "However, that is typically not the case of synthetic datasets.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          10,
          14,
          15,
          16,
          17,
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 30,
      "text": "In early experiments, we trained our pure VAE model on the stochastic shape movement dataset from Babaeizadeh et al. (2018), and our pure VAE was able to model the dataset without any blur and with perfect separation of the possible futures.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          10,
          14,
          15,
          16,
          17,
          18
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 31,
      "text": "We agree that plausibility is indeed important, and that's what our human subject studies try to capture.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          10,
          21
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 32,
      "text": "Since we provide predictions of the whole sequence to the human evaluator, we are not only evaluating for image realism but also for plausibility of the dynamics.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          10,
          21
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 33,
      "text": "Unlike the VAE models that implausibly erase the small objects that are being pushed in the BAIR dataset, our SAVP model moves those objects in a more plausible way.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          10,
          21
        ]
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 34,
      "text": "[1] Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. In Conference on Vision and Pattern Recognition (CVPR), 2018. https://arxiv.org/abs/1711.06077",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 35,
      "text": "[2] Yochai Blau, Roey Mechrez, Radu Timofte, Tomer Michaeli, and Lihi Zelnik-Manor. 2018 PIRM Challenge on Perceptual Image Super-resolution. In Perceptual Image Restoration and Manipulation (PIRM) workshop at ECCV 2018.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "r1gzmeTDnm",
      "rebuttal_id": "rJexhWRXCm",
      "sentence_index": 36,
      "text": "https://arxiv.org/abs/1809.07517",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    }
  ]
}