Quasipseudometric Value Functions with Dense Rewards

Quasipseudometric Value Functions with Dense Rewards

TMLR Paper4035 Authors

22 Jan 2025 (modified: 11 Apr 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: As a generalization of reinforcement learning (RL) to parametrizable goals, goal conditioned RL (GCRL) has a broad range of applications, particularly in challenging tasks in robotics. Recent work has established that the optimal value function of GCRL Q∗ (s, a, g) has a quasipseudometric structure, leading to targetted neural architectures that respect such structure. However, the relevant analyses assume a sparse reward setting—a known aggravating factor to sample complexity. We show that the key property underpinning a quasipseudometric, viz., the triangle inequality, is preserved under a dense reward setting as well, specifically identifying the key condition necessary for triangle inequality. Contrary to earlier findings where dense rewards were shown to be detrimental to GCRL, we conjecture that dense reward functions that satisfy this condition can only improve, never worsen, sample complexity. We evalu- ate this proposal in 12 standard benchmark environments in GCRL featuring challenging continuous control tasks. Our empirical results confirm that training a quasipseudometric value function in our dense reward setting indeed either improves upon, or preserves, the sample complexity of training with sparse rewards. This opens up opportunities to train efficient neural architectures with dense rewards, compounding their benefits to sample complexity.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: All requested changes are in red color. Some are specified here, but a complete list of changes are included in responses to individual reviewers. We have added more context and motivations throughout the paper as advised. We have now made sure to introduce terms before using them, e.g., policy in Sec 2.1, 'standard potential-based shaping rewards' around Eq. 6, potential, etc. We have also added more explanations in Sec. 3.1 about Eq. 6 and arc-cosine distance. We have now added pseudocode in Algorithm 1 with references in Secs. 3.1.4 and 5. As it is fairly self-explanatory, we have not added a "Methods" section to explain it. Hypothesis 1 was indeed malformed. We have now edited it. For Hypothesis 2, we have added two additional potential functions: (1) $\phi$ with a different distance measure, viz., Euclidean distance $d_E$ (introduced in Sec. 3.1); (2) direct potential $\phi_{\text{direct}}(s,a,g)=\|M(s,a)-g\|$ that does not use a distance measure $d$. These are now discussed at the end of Hypothesis 2. The results are now added to Fig. 3, and a new table (Tab. 1) is included to further support Hypothesis 2. We have now added an intuitive explanation of the metric residual network in Section 2.3. We have now added a new Sec. 2.4 to explain the quasipseudometric property. We have now added clarification in Sec. 1 to motivate how promoting reward-shaping / dense reward setting could be beneficial specifically in the context of GCRL. Notation L* is now fully specified and $x_{<t}$ is explained. Other expressions, $F, \phi, d, Q^*_F$ are now explained in more details in Sec. 3.1. The MDPs/tasks studied in the experiments are specified in (Plappert et al., 2018) and we avoid repeating them to save space. Our parameter choice and experimental design are specified below Hypothesis 2 in Sec.5, where we have added further clarifications.

Assigned Action Editor: ~Goran_Radanovic1

Submission Number: 4035

Loading