{
  "metadata": {
    "forum_id": "BkeU5j0ctQ",
    "review_id": "Syev33W527",
    "rebuttal_id": "rkxM5djb0Q",
    "title": "CEM-RL: Combining evolutionary and gradient-based methods for policy search",
    "reviewer": "AnonReviewer3",
    "rating": 7,
    "conference": "ICLR2019",
    "permalink": "https://openreview.net/forum?id=BkeU5j0ctQ&noteId=rkxM5djb0Q",
    "annotator": "anno0"
  },
  "review_sentences": [
    {
      "review_id": "Syev33W527",
      "sentence_index": 0,
      "text": "Gradient-free evolutionary search methods for Reinforcement Learning are typically very stable, but scale poorly with the number of parameters when optimizing highly-parametrized policies (e.g. neural networks).",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 1,
      "text": "Meanwhile, gradient-based deep RL methods, such as DDPG are often sample efficient, particularly in the off-policy setting when, unlike evolutionary search methods, they can continue to use previous experience to estimate values.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 2,
      "text": "However, these approaches can also be unstable.",
      "suffix": "\n\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 3,
      "text": "This work combines the well-known CEM search with TD3 (an improved variant of DDPG).",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 4,
      "text": "The key idea of of this work is in each generation of CEM, 1/2 the individuals are improved using TD3 (i.e. the RL gradient).",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 5,
      "text": "This method is made more practical by using a replay buffer so experience from previous generations is used for the TD3 updates and importance sampling is used to improve the efficiency of CEM.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 6,
      "text": "This work shows, on some simple control tasks, that this method appears to result in much stronger performance compared with CEM, and small improvements over TD3 alone.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 7,
      "text": "It also typically out-performs ERL.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 8,
      "text": "Intuitively, it seems like it may be possible to construct counter-examples where the gradient updates will prevent convergence.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 9,
      "text": "Issues of convergence seem like they deserve some discussion here and potentially could be examined empirically (is CEM-TD3 converging in the swimmer?).",
      "suffix": "\n\n",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 10,
      "text": "The justification that the method of Khadka & Tumer (2018) cannot be extended to use CEM, since the RL policies do not comply with the covariance matrix is unclear to me.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 11,
      "text": "Algorithm 1, step 20, the covariance matrix is updated after the RL step so regardless of how the RL policies are generated, the search distribution on the next distribution includes them.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 12,
      "text": "In both this work, and Khadka & Tumer, the RL updates lead to policies that differ from the search distribution (indeed that is the point), and there is no guarantee in this work that the TD3 updates result in policies close to the starting point.",
      "suffix": "",
      "review_action": "arg_evaluative",
      "fine_review_action": "none",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 13,
      "text": "It sees like the more important distinction is that, in this approach, the information flows both from ES to RL and vice-versa, rather than just from RL to ES.",
      "suffix": "\n\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 14,
      "text": "One view of this method would be that it is an ensemble method for learning the policy [e.g. similar to Osband et al., 2016 for DQN].",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 15,
      "text": "This could be discussed and a relevant control would be to keep a population (ensemble) of policies, but only update using RL while sharing experience across all actors.",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_experiment",
      "aspect": "asp_soundness-correctness",
      "polarity": "none"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 16,
      "text": "This would isolate the ensemble effect from the evolutionary search.",
      "suffix": "\n\n",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 17,
      "text": "Minor issues:",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 18,
      "text": "- The ReLU non-linearity in DDPG and TD3 prior work is replaced with tanh.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 19,
      "text": "This change is noted, but it would be useful to make a least a brief (i.e. one sentence) comment on the motivation for this change.",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_clarity",
      "polarity": "pol_negative"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 20,
      "text": "- The paper is over the hard page limit for ICLR so needs to be edit to reduce the length.",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "arg_other",
      "polarity": "pol_negative"
    },
    {
      "review_id": "Syev33W527",
      "sentence_index": 21,
      "text": "Osband I, Blundell C, Pritzel A, Van Roy B. Deep exploration via bootstrapped DQN. InAdvances in neural information processing systems 2016 (pp. 4026-4034).",
      "suffix": "",
      "review_action": "arg_other",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 0,
      "text": "We thank the reviewer for his/her positive evaluation of our paper and for raising many very useful points which helped us getting to a clearer picture of our contribution.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 1,
      "text": "A few of these points deserve discussion beyond the changes made in the paper.",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 2,
      "text": "Due to a mistake on page 2, we got the reviewer confused believing we are using importance sampling while we are using importance mixing instead.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_contradict-assertion",
      "alignment": [
        "context_sentences",
        [
          5
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 3,
      "text": "This has been fixed.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          5
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 4,
      "text": "The reviewer mentions it may be possible to construct counter-examples where the gradient updates will prevent convergence.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          8,
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 5,
      "text": "This is a very important point.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_contradict-assertion",
      "alignment": [
        "context_sentences",
        [
          8,
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 6,
      "text": "There are many RL problems (see e.g. Continuous Mountain Car, Colas et al. at ICML 2018) where at some point the gradient computed by the critic is deceptive, i.e. it drives the policy parameters into a wrong direction.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_contradict-assertion",
      "alignment": [
        "context_sentences",
        [
          8,
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 7,
      "text": "In that case, applying that gradient to CEM actors as we do in CEM-RL is counter-productive.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          8,
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 8,
      "text": "But the fact that we only apply this gradient to half the population makes it that CEM-RL should nevertheless overcome this issue:  the actors which did not receive a gradient step will be selected and the population will continue improving.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_contradict-assertion",
      "alignment": [
        "context_sentences",
        [
          8,
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 9,
      "text": "However, admittedly, in this very specific context, CEM-RL is behaving as a CEM with only half a population, thus it is less efficient than the standard CEM.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          8,
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 10,
      "text": "Besides, ERL even better resists than our approach to the same issue: if the actor generated by DDPG does not perform better than the evolutionary population due to a deceptive gradient issue, then this actor is just ignored, and the evolutionary part behaves as usual, without any loss in performance.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          8,
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 11,
      "text": "This deceptive gradient issue certainly explains why CEM is the best approach on Swimmer.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          8,
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 12,
      "text": "Finally, it may also happen that the RL part does not bring benefit just because the current critic is wrong and provides an inadequate gradient, in a non-deceptive gradient case.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          8,
          9
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 13,
      "text": "All the above points have now been made much clear in the new version of the paper, in particular we added an appendix dedicated to the swimmer benchmark.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          8,
          9
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 14,
      "text": "The reviewer also raises doubts about the fact that the method of Khadka & Tumer (2018) cannot be extended to use CEM.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 15,
      "text": "After second thoughts, this is absolutely right.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 16,
      "text": "As the reviewer says, in both this work and Khadka & Tumer, the RL updates lead to policies that may differ a lot from the search distribution and there is no guarantee in this work that the TD3 updates result in policies close to the starting point.",
      "suffix": "\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 17,
      "text": "But if the RL actor shows good enough performance, this does not prevent from computing a new covariance matrix which includes it.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 18,
      "text": "The corresponding ellipsoid in the search space may be very large, leading to a widespread next generation, but the process should tend to converge again towards a population of actors where evolutionary and RL actors are closer to each other.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_mitigate-criticism",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 19,
      "text": "A result of these second thoughts is that one could definitely build an ERL algorithm where the evolutionary part is replaced by CEM.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_future",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 20,
      "text": "We corrected the paper according to this new insight.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12,
          13
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 21,
      "text": "Unfortunately we did not find enough time to implement and test this algorithm during the rebuttal stage, but we now mention this possibility as an interesting avenue for future work.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_future",
      "alignment": [
        "context_sentences",
        [
          10,
          11,
          12,
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 22,
      "text": "Despite the very interesting points above, the reviewer is wrong when saying that the main distinction between our approach and the ERL approach is that only in ours the information flow is from ES to RL and vice-versa.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_contradict-assertion",
      "alignment": [
        "context_sentences",
        [
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 23,
      "text": "Actually, in ERL, if the RL actor added to the population performs well, it will steer the whole evolutionary population to the right direction just by generating offsprings, so RL and ES also benefit from each other.",
      "suffix": "\n\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_contradict-assertion",
      "alignment": [
        "context_sentences",
        [
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 24,
      "text": "A lot of our effort during the rebuttal stage has been focused on better highlighting the often subtle differences between ERL and our approach.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_summary",
      "alignment": [
        "context_sentences",
        [
          13
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 25,
      "text": "For doing so, we replaced Figure 1 with a figure directly contrasting CEM-RL to ERL.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          13
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 26,
      "text": "We also added Figure 6 which better highlights the properties of the algorithms and we performed several additional studies described either in the main text or in appendices.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          13
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 27,
      "text": "The next point of the reviewer is that a good deal of the strong performance of our method and RL may just be due to the fact that we are using multiple actors, thus benefiting from an \"ensemble method\" effect already mentioned in several papers such as Osband et al., 2016 for DQN.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          14,
          15,
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 28,
      "text": "This point is absolutely valid.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_concede-criticism",
      "alignment": [
        "context_sentences",
        [
          14,
          15,
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 29,
      "text": "The reviewer thus suggests a relevant control which would be to keep a population (ensemble) of policies, but only update using RL while sharing experience across all actors.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          14,
          15,
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 30,
      "text": "This would isolate the ensemble effect from the evolutionary search effect.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_sentences",
        [
          14,
          15,
          16
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 31,
      "text": "We performed the suggested control.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          14,
          15,
          16
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 32,
      "text": "The resulting algorithm is a multiple-actor version of TD3.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          14,
          15,
          16
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 33,
      "text": "Results show that CEM-TD3 actually outperforms this multiple-actor TD3, thus the CEM part actually brings performance improvement.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          14,
          15,
          16
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 34,
      "text": "About replacing the ReLU non-linearity in DDPG and TD3 prior work with tanh, we spotted that we could get much better results on several environments with the latter.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          18,
          19
        ]
      ],
      "details": {}
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 35,
      "text": "This explanation is now clearly mentioned in the paper, and motivates a future work direction which consists in using \"neural architecture search\" for RL problems, the performance of algorithms being a lot dependent on such architecture details.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          18,
          19
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    },
    {
      "review_id": "Syev33W527",
      "rebuttal_id": "rkxM5djb0Q",
      "sentence_index": 36,
      "text": "Finally, to keep our paper shorter than the hard page limit for ICLR while addressing all the reviewers points, we had to move several studies into appendices, starting with the importance mixing study.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_done",
      "alignment": [
        "context_sentences",
        [
          20
        ]
      ],
      "details": {
        "request_out_of_scope": true
      }
    }
  ]
}