Adversarial Distributional Reinforcement Learning against Extrapolated Generalization (Extended Abstract)

Luchen Li; Aldo A. Faisal

Adversarial Distributional Reinforcement Learning against Extrapolated Generalization (Extended Abstract)

Luchen Li, Aldo A. Faisal

Published: 20 Jul 2023, Last Modified: 29 Aug 2023EWRL16Readers: Everyone

TL;DR: Prevent erroneous generalization in distributional RL using minimax adversarial training.

Abstract: Distributional reinforcement learning (DiRL) accounts for stochasticity in the environment by learning the full distribution of return and has hugely improved performance due to better differentiating between states and training-phase policy evaluation. However, even if the environment is not relied upon for being deterministic, the agent still only gets to traverse a single possible path and therefore observe a single return backup feedback during online learning. Effectively, DiRL is learning the whole distribution using only one sample from it, relying substantially on inductive bias. This work aims to alleviate catastrophically generalizing from a similar-looking state whose behavioural consequence (under the current policy) is actually disparate, i.e., an attack, with adversarial training. To do this, we first identify the set of attacks in which the agent's behavioural consequences are sufficiently dissimilar to the current state, then pick the strongest which incurs the largest model distinguishability error: the smallest distance between predicted return distributions. Finally, we update the return distribution model by ascending the gradient of this minimal distance, effectively solving a minimax problem. In defining attacks, we use bisimulation metric to measure behavioural similarity. To decide the distance between predicted return distributions, which needs to be differentiable with respect to the return distribution model, we train a value discriminator recognizing true Bellman backups from fake ones, and use the contrastive score as a proxy. Experiments on MuJoCo environments suggest that the proposed method is able to improve DiRL performance however the return distribution is modelled.

1 Reply

Loading