Bi-Directional Goal-Conditioning on Single Policy Function for State Space Search

Vihaan Akshaay Rajendiran; Yu-Xiang Wang; Lei Li

Bi-Directional Goal-Conditioning on Single Policy Function for State Space Search

Vihaan Akshaay Rajendiran, Yu-Xiang Wang, Lei Li

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Goal-Conditioning, Deep Reinforcement Learning, State Space Search

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Incorporated bidirectional (from start to goal, and goal to start state) RL with goal conditioning to ensure one policy function to solve multiple tasks.

Abstract: State space search problems have a binary (found/not found) reward system. However, in the real world, these problems often have a vast number of states compared to only a limited number of goal states. This makes the rewards very sparse for the search task. On the other hand, Goal-Conditioned Reinforcement Learning (GCRL) can be used to train an agent to solve multiple related tasks. In our work, we assume the ability to sample goal states and use the same to define a forward task (τ ∗) and a reverse task (τ inv) derived from the original state space search task to ensure more useful and learnable samples. We adopt the Universal Value Function Approximator (UVFA) setting with a GCRL agent to learn from these samples. We incorporate hindsight relabelling with goal-conditioning in the forward task to reach goals sampled from τ ∗, and similarly define ‘Foresight’ for the backward task. We also use the agent’s ability (from the policy function) to reach intermediate states and use these states as goals for new sub-tasks. Further, to tackle the problem of reverse transitions from the backward trajectories, we spawn new instances of the agent from states in these trajectories to collect forward transitions which are then used to train for the main task τ ∗. We consolidate these tasks and sample generation strategies into a three-part system called Scrambler-Resolver-Explorer (SRE). We also propose the ‘SRE-DQN’ agent that combines our exploration module with the popular DQN algorithm. Finally, we demonstrate the advantages of bi-directional goal-conditioning and knowledge of the goal state by evaluating our framework on classical goal-reaching tasks, and comparing with existing concepts extended to our bi-directional setting.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6710

Loading