Keywords: td learning, temporal difference, adversarial training, value function approximation, combinatorial optimization, counter-example learning
Abstract: Temporal Difference (TD) learning is a powerful technique for training value functions in sequential decision-making tasks, but learned value functions often lack formal guarantees. We present \emph{Adversarially-Guided TD (AG-TD)}, which augments standard TD learning with a counter-example sampling strategy to produce provably valid lower bounds. Our approach retains the familiar TD update while adversarially selecting challenging transitions. Specifically, a \emph{Challenger} module periodically solves an auxiliary optimization problem to identify state-action pairs that maximally violate a one-sided Bellman inequality. These “hard” transitions are injected into the experience replay with high priority, so the network focuses its updates on them. We train a value network $V_\theta$ (e.g. a Graph Neural Network) with a one-sided loss $\mathcal{L}(s,a) = [\max(0,\,V_\theta(s) - (-c(s,a)+V_\theta(s')))]^2$, enforcing $V_\theta(s)\le -c(s,a)+V_\theta(s')$. Our main contribution is an empirically practical and theoretically motivated framework that improves generalization of value bounds. In experiments on routing problems, our TD+CER algorithm achieves near-zero violation of optimal costs on both training and larger test instances, whereas standard TD quickly overestimates beyond training sizes. AG-TD thus provides a practical way to train value functions that certify provably correct bounds under distribution shifts.
Submission Number: 49
Loading