Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning
Keywords: Self-supervised RL, Goal-conditioned RL, Contrastive RL, Direction Conditioning, Online RL, Compositional RL
TL;DR: Direction-Conditioned Policies: an online goal-conditioned RL method where the actor consumes a learned direction in InfoNCE representation space, showing consistent gains over Contrastive RL across 9 simulated navigation and manipulation tasks.
Abstract: Hamilton-Jacobi-Bellman (HJB) theory suggests that optimal goal-conditioned actions depend solely on the value gradient of the goal-reaching distance at the current state; yet standard Goal-Conditioned Reinforcement Learning (GCRL) agents are typically conditioned on raw goals—signals that are geometrically uninformative when the goal is distant. We introduce Direction-Conditioned Policies (DCP), a complete online method that bridges this gap by decomposing goal-reaching into waypoint selection and direction-conditioned execution within a shared InfoNCE representation space. DCP trains jointly but factors cleanly at deployment (scoring drops and direction conditioning remains) and admits independent modification at the same interface. We theoretically establish (a) the sufficiency of directional signals under HJB, where the optimal action depends on the goal only through the value gradient; (b) a quantitative bound demonstrating that, under mild conditions, the actor's input distributions during training (on-path waypoints) and deployment (final goals) coincide, bounded only by representation error and geodesic slack; and (c) a controllable-subspace characterisation of directional conditioning failure. DCP consistently outperforms Contrastive RL across nine navigation and manipulation environments, achieving its most significant gains on the hardest manipulation tasks. Qualitative analysis of the learned-distance landscape shows the contrastive representation acts as an online quasimetric encoding environment topology, while the sole failure case (AntSoccer) localises to a learned gradient pathology as predicted by our theory.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 173
Loading