Keywords: td learning, temporal difference, adversarial training, value function approximation, combinatorial optimization, counter-example learning
Abstract: Temporal Difference (TD) learning is a powerful technique utilized for training value functions in sequential decision-making tasks, but learned value functions often lack formal guarantees. We present \emph{Adversarially-Guided TD (AG-TD)}, which adds an augmentation to standard TD learning. But we add a counter-example sampling strategy to produce provably valid lower bounds. Specifically, we propose a \emph{Challenger} module periodically solves an auxiliary optimization problem to identify state-action pairs that maximally violate a one-sided Bellman inequality. These "hard" transitions are injected into the experience replay with a priority based system. This causes the network to focus its updates onto them. We train a value network $V_\theta$ with a one-sided loss $\mathcal{L}(s,a) = [\max(0,\,V_\theta(s) - (-c(s,a)+V_\theta(s')))]^2$, enforcing $V_\theta(s)\le -c(s,a)+V_\theta(s')$.
Submission Number: 49
Loading