\documentclass[12pt]{article}
\usepackage[utf8]{inputenc}
\usepackage{xcolor}
\usepackage{geometry}
\geometry{a4paper, margin=1in}

% Define environments for reviews and answers
\newenvironment{review}{\begingroup\bfseries}{\endgroup}
\newenvironment{answer}{\begingroup\color{blue}}{\endgroup}

\begin{document}

\begin{review}
Reviewer 1 (score 4)\\
\rule{\linewidth}{1pt}

Originality-Novelty: 1: Poor: The main ideas of the paper are either not novel or only incremental.
\end{review}

\begin{answer}
Our work introduces a novel variance reduction technique that leverages Hessian Vector Products (HVP) to bypass the traditional reliance on importance sampling weights. With this approach, we are able to secure theoretical guarantees for convergence to an approximate second-order stationary point (SOSP) with improved sample complexity (i.e., $O(\epsilon^{-3})$), which we believe represents a significant advancement in RL.
\end{answer}

\begin{review}

Q4 Main Weakness: Is finding SOSP in policy optimization not a strong result in policy optimization? There exists several results about convergence toward globally optimal solution such as Agarwal 2019 (Theory of policy gradient method) and many works after that.
\end{review}

\begin{answer}
While achieving global optimality is an attractive theoretical goal, in high-dimensional, nonconvex reinforcement learning problems, finding a global optimum is generally intractable. 
The work referenced by the reviewer provides global optimality guarantees for a specific class of tabular policies, where a gradient domination condition ensures that any first-order stationary point (FOSP) of the value function results in an approximately optimal policy.
Our focus on obtaining approximate SOSPs is motivated by the practical need to escape saddle points and avoid poor local optima. The guarantee of converging to an ($\epsilon, \sqrt{\rho}\epsilon$)-SOSP ensures that our algorithm does not get trapped in poor stationary points and produces more robust policies. Moreover, the improved sample complexity of our method further emphasizes its practical benefits. We will better motivate this in the introduction and discussion sections of the revised paper.
\end{answer}

\begin{review}
Some representative methods on Mujoco such as Soft actor critic and other variants are not compared in the experiments.
\end{review}

\begin{answer}
Our experimental focus was on comparing our method against policy gradient techniques with explicit theoretical guarantees (e.g., PAGE-PG, HAPG, IS-MBPG, ACR-PN) to converge to either $\epsilon$-FOSP or ($\epsilon, \sqrt{\rho}\epsilon$)-SOSP. This selection was made to highlight the effectiveneess of our variance reduction design over the previous ones proposed in the literature of stochastic policy gradient methods. 


We acknowledge that off-policy methods such as SAC, and techniques like TD3 and TRPO, are widely regarded as effective for continuous control tasks. However, their performance can vary across implementations and settings. For instance, [Wang et al., 2022] reported that TRPO underperformed compared to IS-MBPG (which is itself outperformed by our method, VR-SCP). In contrast, in our experiments, we found TRPO to perform better, suggesting that its effectiveness may be highly sensitive to implementation details.
Given these discrepancies, we decided not to include algorithms without formal theoretical guarantees in our experiments and further investigate them in future work.
\end{answer}

\begin{review}
Q5 Detailed Comments To The Authors:
Some claims seem not correct to me. For example, it was said in the related work that "the convergence analysis for these methods requires strong assumptions such as the boundedness of variance of IS weights", and also "Our proposal, VRSCP algorithm, does not require IS and ...". This seems to suggest that the proposed method doesn't have to use IS or can deal with unboundedness. However, I think this is mainly because of the difficulty for unbounded IS does go away because of the first inequality in Lemma 3.3. A bounded stochastic gradient surely doesn't need any other assumption like IS is bounded (it already has boundedness). The references where this Lemma is from should also be able to deal with this difficulty.
\end{review}

\begin{answer}
In our work, we require only Assumption 3.1 (Bounded reward) and Assumption 3.2 (Parameterization regularity) to establish our theoretical guarantees. The boundedness of the stochastic gradient is obtained from Assumptions 3.1 and 3.2(a), and no additional assumptions are necessary. In contrast, methods that rely on importance sampling (IS) weights must introduce an extra assumption to ensure the variance of the IS weights is bounded. Since our proposed method does not use IS weights, this assumption is not needed in our work. For a discussion on why assuming bounded variance of IS weights is not a reasonable assumption in practice, see the explanation following Assumption 3 in Huang et al. (2020).
\end{answer}


\newpage
\begin{review}
Reviewer 2 (score 7)\\
\rule{\linewidth}{1pt}


The theoretical section requires enhanced clarity. The meaning of the theorems is not explicitly explained. The explanation of the theory is suggested to be included, which will help the reader to better understand the paper.
\end{review}

\begin{answer}
    In order to prove Theorem 3.9, we need to provide bounds on $\| v_t - \nabla J(\theta_t) \|$ and $\| U_t - \nabla^2 J(\theta_t) \|^2$ as they appear in the recursive inequality in Eq. (77) for analyzing the convergence of VR-SCP. 
We began by showing that the term 
$\left\|\sum_{s=1}^{S_t} \hat{\nabla}^2 J(\theta_{s,t}, \tau_s)(\theta_t - \theta_{t-1}) - \left(\nabla J(\theta_t) - \nabla J(\theta_{t-1})\right)\right\|$
is bounded by a quadratic function of $\|\theta_t - \theta_{t-1}\|$ with high probability (see Eq. (53) in the appendix). Based on this result, and by meticulously adjusting the batch sizes, we provided a bound on $\| v_t - \nabla J(\theta_t) \|_2^2$ at any time $t$ (with high probability) in Lemma 3.7. We also provided a bound on $\| U_t - \nabla^2 J(\theta_t) \|^2$ in Lemma 3.8. By substituting these bounds into the recursive inequality in Eq. (77), we obtained the desired convergence rate stated in Theorem 3.9. To the best of our knowledge, the high-probability analysis to bound $\| v_t - \nabla J(\theta_t) \|$ is novel, and it enables us to obtain the sample complexity to $\tilde{O}(\epsilon^{-3})$. 
We will add a discussion about the proof after the main theorem in revised version.
\end{answer}

\begin{review}
    The current comparison results are not enough: The proposed method should be systematically compared with state-of-the-art policy gradient algorithms, including PGPE, REINFORCE++, and PPO.
\end{review}

\begin{answer}
    Our experimental focus was on comparing our method against policy gradient techniques with explicit theoretical guarantees (e.g., PAGE-PG, HAPG, IS-MBPG, ACR-PN) to converge to either $\epsilon$-FOSP or ($\epsilon, \sqrt{\rho}\epsilon$)-SOSP. This selection was made to highlight the effectiveneess of our variance reduction design over the previous ones proposed in the literature of stochastic policy gradient methods. 
    We acknowledge that off-policy methods such as SAC, and techniques like TD3, and methods such as PPO, TRPO, are widely regarded as effective for continuous control tasks. However, their performance can vary across implementations and settings. For instance, [Wang et al., 2022] reported that TRPO underperformed compared to IS-MBPG (which is itself outperformed by our method, VR-SCP). In contrast, in our experiments, we found TRPO to perform better, suggesting that its effectiveness may be highly sensitive to implementation details.
Given these discrepancies, we decided not to include algorithms without formal theoretical guarantees in our experiments and further investigate them in future work.
\end{answer}

\begin{review}
Besides PR, some other metrics should also be evaluated, such as computational efficiency, since the method involves second-order gradient calculation.
\end{review}




\begin{answer}
%Regarding computational complexity, our experiments indicate that although our method introduces some additional computation compared to other RL baselines, this is offset by a significant reduction in the number of system probes and overall training cost. 
We emphasize that the cost of computing Hessian-Vector Products (HVPs) using Pearlmutter’s algorithm [Pearlmutter, 1994] is of the same order as that of computing gradients, namely $O(Hd)$ where \(d\) is the dimension of the gradient vector and \(H\) is the horizon.
Below, we provide a runtime comparison to reach an average return of 250 in Humanoid environment considered in the paper:
| Algorithm      | Time (mins) |
|-----------|-------------|
| IS-MBPG   | 54.64       |
| VR-SCP    | 16.66       |
| VR-BGPO   | 7.41        |
| PAGE-PG   | 7.49        |

Based on these runtimes, our algorithm's performance is comparable to that of PAGE-PG and superior to IS-MBPG—both of which rely on stochastic gradients in their update rules, demonstrating that the computation for HVPs in our method does not significantly affect overall efficiency.

\end{answer}

\begin{review}
Q5 Detailed Comments To The Authors:
\begin{review}
    1. What is PR in your experiments for the evaluation metrics? It is required to be defined at its first appearance. 

\end{review}

\begin{answer}


The performance-robustness (PR) metric is defined in eq. (23) and it is used for hyper-parameter tuning. In particular, PR is defined as the average of the lower confidence intervals (LCIA) of the return over the system probes.
%In practice, based on the task complexity, we consider a large enough number of evaluation points (e.g., 10 million system probs) and compute the average of ICI over all these evaluation points.
\end{answer}

\begin{review}
2. Current experiments focus on traditional policy gradient methods. It is expected to include comparisons with some SOTA policy models, such as diffusion models or Transformer-based methods to demonstrate VR-SCP's generality. 
\end{review}

\begin{answer}
    We appreciate the reviewer’s suggestion. As computing the Hessian-vector product (HVP) (or any other optimization technique for that matter) for large models such as transformers is significantly more time-consuming and requires significant computational resources (e.g., additional GPUs), it makes sense to first prove a new approach is indeed advantageous compared to the state-of-the-art optimization approaches. This is the goal of our current paper and the idea of the reviewer can be an interesting study for future work.
     %We appreciate the reviewer’s suggestion. However, given our current computational and time constraints, we were unable to test these models.  This can be an interesting study for future work.
\end{answer}

\begin{review}
3. The mechanism of HVP in variance reduction is not explicitly explained. Add a schematic or formula derivation to illustrate how HVP reduces gradient estimation variance. 
\end{review}

\begin{answer}

    In the update  $v_t=\frac{1}{S_t}\sum_{s=1}^{S_t} \hat{\nabla}^2 J(\theta_{s,t},\tau_s)(\theta_t-\theta_{t-1})+ v_{t-1}$ (in line 2 of Algorithm 1), the term $\frac{1}{S_t}\sum_{s=1}^{S_t} \hat{\nabla}^2 J(\theta_{s,t},\tau_s)(\theta_t-\theta_{t-1})$ provides an estimate of  $\nabla J(\theta_k) - \nabla J(\theta_{k-1})$ (see the proof of Lemma 3.7 in the appendix for more details). In the context of stochastic optimization, the update of SARAH also uses a batch of $\nabla f(\theta_k,z) - \nabla f(\theta_{k-1},z)$ to have an estimate $\nabla f(\theta_k) - \nabla f(\theta_{k-1})$. However, directly applying it in RL setting requires IS weights as the same randomness $z$ should be used for the two consecutive points. Our update bypasses this requirement by leveraging a HVP term.
\end{answer}

\begin{review}
4. Could the proposed method be applied to solve the discrete action space problems? For the formulation, it could deal with the discrete action space, but the experiments are limited to continuous control tasks. It is better to clarify applicability boundaries for discrete action spaces. 
\end{review}

\begin{answer}
    The results hold for both continuous and finite state and action spaces as long as the regularity conditions for the parameterized policy hold in Assumption 3.2. As an example for finite state and action space, in [Masiha et al. 2022] (Section A.4.4), it has been shown that for softmax tabular policy, there exist some constants $G$, $L_1$, and $L_2$ satisfying Assumption 3.2.
    We will clarify this in the revised version.
    
[Masiha et al. 2022] Masiha, Saeed, Saber Salehkaleybar, Niao He, Negar Kiyavash, and Patrick Thiran. ``Stochastic second-order methods improve best-known sample complexity of SGD for gradient-dominated functions." Advances in Neural Information Processing Systems 35 (2022): 10862-10875.
\end{answer}


\begin{review}
5. The hyperparameter tuning section mentions grid search but does not specify search ranges, which is required to be clarified. 
\end{review}

\begin{answer}
    In the revised version, we will include a table with the search ranges of the hyperparameters.
\end{answer}

6. The related work on SCR-PG and STORM-PG are excluded due to unavailable code. I do not think this reason is acceptable for missing the comparison with important baselines.
\end{review}

\begin{answer}
%The primary reason for not including these baselines was the lack of reliable and official implementations available at the time of our experiments. Implementing their methods based solely on their descriptions could be sensitive to the implementation details, which made a fair comparison challenging.

Implementing their methods based solely on their descriptions could have been a disadvantage to their authors as the choice of various hyperparameters and  in general various implementation details could significantly hinder fair comparison.
% \textbf{WE SHOULD WRITE ONE OF THE TWO REASONS BELOW (SECOND ONE BETTER)}

%Furthermore, SCR-PG is very similar to ACR-PN in terms of both sample complexity and algorithm. In our experiments, ACR-PN is clearly outperformed by our method, suggesting that incorporating SCR-PG would likely yield similar trends. On the other hand, 
Moreover, based on the experimental comparisons provided in the work of SCR-PG, the performance gap between IS-MBPG and SCR-PG is quite subtle, suggesting that incorporating SCR-PG would likely yield similar trends too. Nonetheless, if you wish so, we can directly compare with SCR-PG and STORM-PG to further enhance our experiments. %In the revised version, we will add the results of SCR-PG and STORM-PG in case their codes become available.
%In our revised submission, we will implement SCR-PG and STORM-PG carefully and include detailed comparisons with these baselines.



\end{answer}
\newpage
\begin{review}

Reviewer 3 (score 7)\\
\rule{\linewidth}{1pt}

Q4 Main Weakness:
The proof techniques and ideas seem to be borrowed heavily from the literature, and the differences should be articulated.
\end{review}

\begin{answer}
\textcolor{blue}{}
\end{answer}

\begin{review}
Q5 Detailed Comments To The Authors:
1. It seems that many ideas and techniques are borrowed from the existing literature. What are the technical barriers that are overcome in this work? 
\end{review}

\begin{answer}
In the context of stochastic optimization, Hessian-vector product (HVP) terms are primarily used to solve the sub-problem in cubic regularized Newton (CRN) methods. As mentioned in the related work, [Zhou and Gu 2020] introduced a variance-reduced version of SCRN with a sample complexity of $O(\epsilon^{-3})$, where the HVP terms are utilized only in using the sub-problem. However, utilizing HVP terms for variance reduction is not a common practice in stochastic optimization literature, as the loss function is oblivious (i.e., its randomness is independent of the model parameters). The variance reduction in stochastic optimization often relies on the difference between gradients at two consecutive points, $\theta_{t-1}$ and $\theta_t$, with the same randomness.

    In contrast, in the RL setting, the randomness (i.e., the trajectory) depends on the model's parameters. Adapting variance reduction techniques from stochastic optimization to RL often requires the use of importance sampling (IS) weights. To bypass IS weights, we utilize HVP terms into the variance reduction framework.
 
    
    The only other work we are aware of that used HVP for variance reduction in RL is HAPG. However, our method differs in three main aspects, as highlighted in the introduction of the submitted paper. To reiterate, the key differences are as follows:

    1. HAPG requires the parameters to be updated using a fixed step size $\epsilon$ to bound the variance of gradient estimates. This constraint can slow down the training process in practice.
    2. In HAPG, the number of HVP computations per iteration is $O(1/\epsilon)$, whereas in our approach, it depends on the norm of the parameter updates, leading to potentially lower computational overhead.
    3. HAPG leverages second-order information solely through HVP for variance reduction and hence achieves $\epsilon$-FOSP, therefore, as we saw in our experiments, it misses the main performance advantages of converging to $(\epsilon, \sqrt{\rho \epsilon})$-SOSP. In particular, in Walker environment, other methods (including HAPG) have lower performance (which might be explained by getting stuck in bad local minima or saddle points), while VR-SCP has much higher performance, showing its ability to escape saddle points.
\end{answer}

    \begin{review}
        2. What does $\lfloor t/Q\rfloor.Q$ mean above Lemma 3.7? 
    3. In Algorithm 1, why are there two different batches? Are these two related? 

    \end{review}
    \begin{answer}
    In our algorithm, we use a checkpoint period $Q$. That is, every $Q$ iteration (when $t \bmod Q = 0$, the gradient estimate $v_t$ is reset by computing the average over a batch of stochastic gradients $\mathcal{B}_{check}$). Thus, at any iteration $t$, the most recent checkpoint occurs at $\lfloor t / Q \rfloor \cdot Q$, where $\lfloor \cdot \rfloor$ denotes the floor operation. In Lemma~3.7, we showed that for any iteration $t$, the norm of the gradient estimation error ($\| v_i-\nabla J(\theta_i) \|^2_{2}$) is bounded for all iterations $i$ satisfying $ \lfloor t/Q \rfloor \cdot Q \le i \le t$, with probability $1-2\delta(t-\lfloor t/Q \rfloor.Q )$. 


    As mentioned above, $\mathcal{B}_{check}$ is the batch of trajectories sampled from the policy $\pi_{\theta_t}$ at the checkpoints. This batch is used to provide an unbiased estimate of the gradient at the checkpoints. $\mathcal{B}_{h}$ is a separate  batch of trajectories sampled from the policy $\pi_{\theta_t}$ to compute HVPs for every vector $v$ at any iteration $t$, defined by function $U_{t}[.]$. Based on Lemmas 3.7 and 3.8, we established the batch size needed for the estimation of the gradient and HVP, respectively (see Theorem 3.9).
    % This mechanism limits the accumulation of stochastic errors by “resetting” the estimator at regular intervals. Such a design is key to controlling the variance in our algorithm and hence to the convergence guarantees we establish.
\end{answer}

    \begin{review}
        4. Some explanations of Cubic-Subsolver and CubicFinalsolver should be given in the main text.
    \end{review}
    \begin{answer}

    Thank you for pointing this out. We will include a more detailed explanation of these algorithms in the main body in the revised version.

    
\end{answer}
\newpage
\begin{review}
Reviewer 4 (score 3)\\
\rule{\linewidth}{1pt}

Q4 Main Weakness:
Extent To Which Claims Are Supported By Evidence: Several claims may be debatable.
\begin{enumerate}
    \item "The proposed algorithm outperforms the state of the art and is more robust to changes in random seeds." Clearly, the paper didn't compare the proposed method with any state-of-the-art algorithms in Mujoco continuous controls, e.g., Soft Actor-Critic (SAC) and TD3.


\begin{answer}
Our experimental focus was on comparing our method against policy gradient techniques with explicit theoretical guarantees (e.g., PAGE-PG, HAPG, IS-MBPG, ACR-PN) to converge to either $\epsilon$-FOSP or ($\epsilon, \sqrt{\rho}\epsilon$)-SOSP. This selection was made to highlight the effectiveness of our variance reduction design over the previous ones proposed in the literature of stochastic policy gradient methods. 


We acknowledge that off-policy methods such as SAC, and techniques like TD3 and TRPO, are widely regarded as effective for continuous control tasks. However, their performance can vary across implementations and settings. For instance, [Wang et al., 2022] reported that TRPO underperformed compared to IS-MBPG (which is itself outperformed by our method, VR-SCP). In contrast, in our experiments, we found TRPO to perform better, suggesting that its effectiveness may be highly sensitive to implementation details.
Given these discrepancies, we decided not to include algorithms without formal theoretical guarantees in our experiments and further investigate them in future work.
\end{answer}



    
    \item "No theoretical guarantees for the convergence rate of most of these algorithms are available." There are some theoretical analyses of TRPO and PPO, see Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy.
    
    \begin{answer}
        While it is true that some recent work (e.g., “Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy”) has provided theoretical analyses for TRPO and PPO, these analyses come with restrictive assumptions (For instance, the work suggested by the reviewer assumes that the policy is an over-parameterized shallow neural network with two layers.). Our statement was intended to emphasize that, in the context of \textbf{nonconvex} policy optimization, there is no theoretical guarantee for these methods to converge to either FOSP or SOSP, while our method provides explicit guarantees for converging to ($\epsilon, \sqrt{\rho}\epsilon$)-SOSP with the sample complexity of $\tilde{O}(\epsilon^{-3})$ in the context of non-convex policy optimization.
    \end{answer}
    \item "A new metric that incorporates both performance (the average return) and robustness (sensitivity to random seeds) … crucial in terms of reproducibility of the results." This suggests that all previous published results reported via mean and standard deviation over multiple runs are not reproducible? Isn't the standard deviation the metric for robustness?

    \begin{answer}
     We kindly draw the reviewer's attention to the full sentence in our paper: ``we define a new metric that incorporates both
performance (the average return) and robustness (sensitivity to random seeds) of an RL algorithm, where the
latter is crucial in terms of reproducibility of the results." Therefore, the term ``latter'' refers to robustness. In the last paragraph of page 7 (right column), we also mention STD as a measure of robustness. However, our main goal is to define a metric that combines both performance and robustness, which can be used for hyperparameter selection. For further details, please refer to the explanation following Equation (23).
    \end{answer}
\end{enumerate}



  
\end{review}

\begin{answer}
\textcolor{blue}{}
\end{answer}

\begin{review}
Weak baselines and configurations: In the empirical evaluation, the paper compared the proposed algorithm "with PG methods that provide theoretical guarantees" and REINFORCE. If REINFORCE was compared against, I don't see why we shouldn't compare the method with other PG variants, including A3C, TD3, PPO, which should have much stronger performance than the baselines considered in the paper.
\begin{answer}
    Answered Above
\end{answer}

Also, "We chose a maximum horizon of 500 for Walker, Hopper, and Humanoid and 50 for Reacher." The choice of horizon seems non-standard. Why not 1000 for all? (as adopted in many other RL papers?).

\begin{answer}
    The chosen horizon lengths  (500 for Walker, Hopper, and Humanoid, and 50 for Reacher) were based on standard settings reported in the literature. For instance, HAPG [Shen et al., 2019] employed a horizon of 500 for most environments, and IS-MBPG [Huang et al., 2020] used a horizon of 500 for Walker. 
    % We would like to note that using a horizon of 500 instead of 1000 generally makes the task more challenging, as there are fewer steps available for the agent to explore the environment.
\end{answer}
\end{review}

\begin{answer}
\textcolor{blue}{}
\end{answer}

\begin{review}
Theoretical results not corroborated by the experiments: Moreover, there is no direct connection between the theoretical results and the empirical analysis. If the strong theoretical guarantees are promised in the theory under 
, then is there a way to corroborate such guarantees?
\end{review}

\begin{answer}
We agree that bridging theory and practice is important. While our theoretical results provide convergence guarantees and sample complexity bounds, our experiments demonstrate that these improvements translate into better empirical performance and robustness. 
In particular, in Walker environment, other methods have lower performance (which might be explained by getting stuck in bad local minima or saddle points), while VR-SCP has much higher performance, showing its ability to escape saddle points.
However, for further direct evaluations, in the revised version we will include plots showing the decay of the gradient norm over time. Additionally, we plan to include a plot for the metric $\mu$ if computing the minimum eigenvalue is not computationally expensive.
\end{answer}

\begin{review}
Clarity Of Writing: The paper is generally very hard to read and understand. First, for the main motivation "A natural question to ask is whether there exists a second-order PG method that converges … but without using IS weights?" The paper didn't explain why this question is important to study (many questions can be interesting but maybe insignificant?). Second, there is no general introduction of the algorithm: what's the underlying intuition? 

\begin{answer}
    Regarding the motivation for not using IS weights in the variance reduction component, we provide a detailed explanation of the drawbacks associated with using IS weights in variance-reduced policy gradient methods—specifically in the context of reinforcement learning—in the first paragraph of page 2 (left column). In the revised version, we will explicitly reference this discussion before introducing the main research question addressed in our work. As for the algorithm, while it is described in Section 3.1, we agree with the reviewer that it would be helpful to present a high-level overview of its components at the beginning. We will incorporate this in the revised version.

    
\end{answer}
Third, several statements are confusing: "Figure 1 illustrates why we take the average." I don't see how figure 1 illustrates the choice. On the contrary, Figure 1 shows that LCIA ranking can be upended if system probes increase. "LCIA .. prevents us from choosing the hyper-parameters that aggressively improve the returns in the beginning but can degrade drastically by the end of the horizon." Shouldn't we search the hyper-parameter space to achieve the long-term returns?
\end{review}

\begin{answer}
The statement ``prevents us from choosing the hyper-parameters that
aggressively improve the returns in the beginning but can
degrade drastically by the end of the horizon.'' is meant to convey that our Performance Robustness (PR) metric favors hyper-parameter configurations that yield consistent, stable performance throughout the entire training period rather than those that achieve early gains but fail to sustain them later. In other words, by averaging LCIA over the full training horizon, our PR metric guides the selection toward hyper-parameter configurations that perform robustly over time. Moreover, we compare methods under a fixed budget of system probes. If two configurations achieve the same average return at the end of the training, LCIS selects the one with a better rising trend.
\end{answer}

\begin{review}
Q5 Detailed Comments To The Authors:
1. The variance reduction part in the second-order policy gradient is unclear to me. Can authors explain how the variance is reduced? 
\end{review}
\begin{answer}
    In the update  $v_t=\frac{1}{S_t}\sum_{s=1}^{S_t} \hat{\nabla}^2 J(\theta_{s,t},\tau_s)(\theta_t-\theta_{t-1})+ v_{t-1}$ (in line 2 of Algorithm 1), the term $\frac{1}{S_t}\sum_{s=1}^{S_t} \hat{\nabla}^2 J(\theta_{s,t},\tau_s)(\theta_t-\theta_{t-1})$ provides an estimate of  $\nabla J(\theta_k) - \nabla J(\theta_{k-1})$ (see the proof of Lemma 3.7 in the appendix for more details). In the context of stochastic optimization, the update of SARAH also uses a batch of $\nabla f(\theta_k,z) - \nabla f(\theta_{k-1},z)$ to have an estimate $\nabla f(\theta_k) - \nabla f(\theta_{k-1})$. However, directly applying it in RL setting requires IS weights as the same randomness $z$ should be used for the two consecutive points. Our update bypasses this requirement by leveraging a HVP term.
\end{answer}
\begin{review}
2. "GAE … exploiting a temporal difference relation for the advantage function approximation." There is no function approximation for advantages in GAE. GAE is to estimate the advantages from value functions.
\end{review}

\begin{answer}
\textcolor{blue}{Thanks for the comment. We will revise the sentence to: `` (Schulman et al., 2015) presented GAE to control both bias and variance by exploiting a temporal difference relation for advantage estimation."
}
\end{answer}
\newpage
\begin{review}
Reviewer 5 (score 7)\\
\rule{\linewidth}{1pt}

Q4 Main Weakness:
1. It would be beneficial if the authors could include a detailed runtime comparison of the proposed algorithm against the baselines in the experimental section.
\end{review}

\begin{answer}
%Regarding computational complexity, our experiments indicate that although our method introduces some additional computation compared to other RL baselines, this is offset by a significant reduction in the number of system probes and overall training cost.
We emphasize that the cost of computing Hessian-Vector Products (HVPs) using Pearlmutter’s algorithm [Pearlmutter, 1994] is of the same order as that of computing gradients, namely $O(Hd)$ where \(d\) is the dimension of the gradient vector and \(H\) is the horizon.
Below, we provide a runtime comparison to reach a return of 250 in Humanoid environment considered in the paper:
| Algorithm      | Time (mins) |
|-----------|-------------|
| IS-MBPG   | 54.64       |
| VR-SCP    | 16.66       |
| VR-BGPO   | 7.41        |
| PAGE-PG   | 7.49        |

Based on these runtimes, our algorithm's performance is comparable to that of PAGE-PG and superior to IS-MBPG—both of which rely on stochastic gradients in their update rules, demonstrating that the computation for HVPs in our method does not significantly affect overall efficiency.
\end{answer}

\begin{review}
   2. Could the authors provide some suggestions on how to choose (T) in the performance-robustness metrics? 
\end{review}


\begin{answer}



    PR is defined as the average of the lower confidence intervals (LCIA) of the average return over the evaluation checkpoints (system probes). In practice, based on the task complexity, we consider a training horizon (e.g., for control tasks, 10 million system probes) and compute the average of LCIA over all these evaluation points.
\end{answer}



\end{document}