\section{Evaluation}
\label{evaluation}



To demonstrate the effectiveness of the proposed algorithm, NAC-DD, we compare its performance against several standard policy gradient methods, including Vanilla Policy Gradient (PG) \citep{DBLP:conf/nips/SuttonMSM99}, Natural Policy Gradient (NPG) \citep{kakade2001natural}, TRPO \citep{DBLP:journals/corr/SchulmanLMJA15}, and PPO \citep{DBLP:journals/corr/SchulmanWDRK17}. For the baseline implementations, we utilized the open-source repository available at https://github.com/reinforcement-learning-kr/pg\_travel. The evaluation is conducted on three benchmark MuJoCo locomotion tasks: HalfCheetah-v3, Hopper-v3, and Walker2d-v3, all of which are continuous control problems.
Notably, the codes for reproducing all results in this paper is available at https://github.com/LucasCJYSDL/NAC-DD. 



We begin by evaluating the impact of the key hyperparameter—the drop number $M$ in Algorithm \ref{alg:nacdd}—on the performance of NAC-DD. As illustrated in Figure \ref{fig:1}, we set $M$ to 1, 3, and 5, and record the training progress of these variants on the MuJoCo tasks. Each experiment is repeated three times with different random seeds, with the means and 95\% confidence intervals shown as solid lines and shaded areas, respectively. The results indicate that NAC-DD consistently achieves better performance when the drop number exceeds one. This improvement is attributed to the fact that dropping training samples helps mitigate the statistical dependency between samples from different time steps, aligning with the theoretical requirements. The performance of the algorithm with drop numbers of three and five is comparable. However, we anticipate that a larger drop number could be more beneficial for addressing more challenging control tasks (than MuJoCo).

In Figure \ref{fig:2}, we position our algorithm by comparing it against standard policy gradient methods on various MuJoCo tasks. The results show that natural-policy-gradient-based methods consistently outperform the vanilla policy gradient approach\footnote{For practical implementation, we estimate the natural policy gradient, as defined in Equation \eqref{npg}, and update the actor accordingly using the conjugate gradient method combined with backtracking line search.} Additionally, actor-critic methods (i.e., NAC-DD, PPO, and TRPO) generally outperform pure policy gradient methods (i.e., NPG and PG). Notably, our algorithm achieves the best performance on two out of three tasks and ranks second on the third task. Thus, while this is a theory-focused paper with the algorithm built on solid theoretical foundations, its strong practical performance in challenging continuous control tasks further demonstrates its effectiveness and applicability.

Finally, for completeness, Table \ref{table:comp-time} reports the training time for each algorithm on the Hopper, HalfCheetah, and Walker2d benchmarks using a single NVIDIA GeForce RTX 2080 Ti GPU. The table above reports the training time (in hours) for each algorithm on each benchmark, using a single NVIDIA GeForce RTX 2080 Ti GPU. NAC-DD-5 requires more training time because it discards 80\% of the collected samples, using only the remaining 20\% for training. To ensure the total number of training samples is comparable to other methods, NAC-DD must collect more transitions. However, this additional sampling can be parallelized using a vectorized environment setup. In terms of computation time for policy and critic updates, NAC-DD is comparable to TRPO and PPO, as demonstrated by the results of NAC-DD-1 in relation to the other methods.

