Abstract: Recent work by~\cite{asadi2023td} gave an optimization view of TD learning with target networks and showed stability under a force-dominance condition, but their linear-rate analysis relies on global smoothness (a uniform curvature bound). This assumption can fail even when the inner problem is well posed, since curvature encountered during training can grow with the scale of TD residual–induced gradients. We retain the stabilized regime from prior theory—strong convexity in the inner variable—to isolate upper-curvature growth effects. Under generalized smoothness, where the Hessian norm may grow with gradient scale via a nondecreasing profile $\ell(\cdot)$, we analyze the inexact TD recursion with $K$ inner gradient steps per target refresh and propose a curvature-checked constant stepsize rule that ensures global stability without a global smoothness constant. Our main result proves global linear convergence under force dominance with a single trajectory-dependent admissibility requirement governed by the maximum gradient magnitude $M$ encountered along the run. This yields an explicit scaling law: the largest admissible constant stepsize decays as $1/\ell(cM)$ (for a universal constant $c$), and maintaining a fixed contraction requires $K$ to grow proportionally to $\ell(cM)$. In the uniformly smooth case we recover~\cite{asadi2023td}, while under curvature growth the worst trajectory gradient scale controls both stability and attainable convergence speed, aligning with step-control heuristics used in reinforcement learning (RL)
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We thank the reviewer for the careful reading and constructive feedback. In response, we have revised the manuscript substantially to clarify scope, strengthen the exposition, and add targeted empirical support. In particular, we made the following changes:
$\textbf{Reviewer Wh3n}$
$\textbf{1 - Toned down the deep-RL applicability claim.}$ We revised the manuscript wherever the connection to modern deep RL could be read too broadly. In particular, we no longer frame the paper as establishing direct applicability to generic deep RL. Instead, we now describe the contribution more precisely as \emph{a mechanism-level analysis of why curvature-aware step control can matter in stabilized TD-style optimization under non-uniformly bounded Hessians.
$\textbf{2-Added explicit examples showing that the assumption class is non-empty.}$ To address the concern that the generalized-smoothness setting might appear abstract, we added explicit TD-structured examples in Appendix~A.2. These examples show constructively that the class of problems covered by our theory is genuinely populated, and they make the role of the assumptions and the resulting stability law fully explicit.
$\textbf{3-Added theorem-aligned experiments.}$ We added a new toy experimental section in Appendix A.3 and a real-data experimental section in Appendix A.4. These experiments are designed as targeted sanity checks of the stability mechanism predicted by the theory: they illustrate that aggressive fixed steps can become unstable, while the proposed curvature-checked rule remains stable by adapting to the local safe scale.
$\textbf{4-Expanded the discussion of estimating the curvature profile } \ell(\cdot)$ We clarified that practical estimation of $\ell(\cdot)$ is an important but separate problem from the convergence analysis carried out in this paper. At the same time, we do not leave the profile entirely abstract: in Appendix~A.2, we make the generalized-smoothness profile concrete in analytically tractable TD-style examples by deriving explicit nondecreasing curvature profiles and showing how the curvature-checked stepsize can be instantiated directly from them.
$\textbf{Reviwer PGFA}$
$\textbf{1 - Fixed Abstract:}$ Revised the abstract and made it self contained.
$\textbf{2 - Notations:}$ Added a brief note acknowledging the more standard RL notation (using a single variable with a minus superscript for the target, $\theta^-$) to improve readability for a broader RL audience.
$\textbf{3 - Language Correction:}$ Corrected the language in the introduction regarding the relationship between RL and AI.
$\textbf{4 - Added Explicit Examples:}$ To address the concern that the generalized-smoothness setting might appear abstract, we added explicit TD-structured examples in Appendix~A.2. These examples show constructively that the class of problems covered by our theory is genuinely populated, and they make the role of the assumptions and the resulting stability law fully explicit.
Assigned Action Editor: ~Nicolas_Loizou1
Submission Number: 7780
Loading