POMRL: No-Regret Learning-to-Plan with Increasing Horizons

Khimya Khetarpal; Claire Vernade; Brendan O'Donoghue; Satinder Singh; Tom Zahavy

POMRL: No-Regret Learning-to-Plan with Increasing Horizons

Khimya Khetarpal, Claire Vernade, Brendan O'Donoghue, Satinder Singh, Tom Zahavy

Published: 18 Jul 2023, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Authors that are also TMLR Expert Reviewers: ~Khimya_Khetarpal1

Abstract: We study the problem of planning under model uncertainty in an online meta-reinforcement learning (RL) setting where an agent is presented with a sequence of related tasks with limited interactions per task. The agent can use its experience in each task and across tasks to estimate both the transition model and the distribution over tasks. We propose an algorithm to meta-learn the underlying relatedness across tasks, utilize it to plan in each task, and upper-bound the regret of the planning loss. Our bound suggests that the average regret over tasks decreases as the number of tasks increases and as the tasks are more similar. In the classical single-task setting, it is known that the planning horizon should depend on the estimated model's accuracy, that is, on the number of samples within task. We generalize this finding to meta-RL and study this dependence of planning horizons on the number of tasks. Based on our theoretical findings, we derive heuristics for selecting slowly increasing discount factors, and we validate its significance empirically.

Certifications: Expert Certification

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: **Revised draft includes the following changes.** 1. As recommended **[R-C8FM]**, we have listed the set of assumptions made formally in Section 3.1 and refer back to them as and when needed. We hope this improves the flow of the reading and overall comprehension. See Sec 3.1. 2. As suggested **[R-C8FM, R-pog9]**, we have added a real-world motivation for the problem setting. In many real world scenarios such as robotics, it is required to be responsive to changes in the environment and, at the same time, to be robust against perturbation inherent in the environment and their decision making. We have added this motivation around the text in both the introduction and Section 3.1.1 where we formally define the structural assumption across tasks. Having said that, ours is a theoretical paper that cannot (over) claim to solve real world tasks, so we try to distinguish between the motivation and the actual contributions. 3. As suggested **[R-C8FM]**, we have simplified Equation 4 including explanation of the more complex terms connecting to the text already around it. 4. As suggested **[R-C8FM]**, we have clarified the term ``underlying structure" during the first usage in the introduction. Moreover, to further improve comprehension, we have used the term task-'relatedness' interchangeably with task-'similarity' as opposed to structure where feasible. By underlying structure, we refer to how the tasks are related to each other. More specifically, how the transition dynamics across tasks are related. 5. As suggested **[R-C8FM]**, to clarify Figure 1, we have added to the caption that the blue dots indicate each task $P^{t}$. In left most figure, the red circle diameter represents the variance parameter $\sigma$ also known as the measure of task-similarity centered at mean $P^o$. The arrow is simply pointing to the mean of a Gaussian meta-learned model. Please see revised caption of Figure 1. Note that the aleatoric uncertainty on the transitions induced by each $P^t$ (that we upper bound by $v^2$ later) is not represented on this illustrative figure as it is a simple notation that does not imply any further assumption (in fact, $v^2\leq 0.25$ so it could be replaced by a constant everywhere). 6. As suggested **[R-C8FM]**, we have addressed the minor edits including a) Section 2: consequently to define -> consequently we define; b) Section 4: comes comes -> comes; Section 4: will gives a -> will give a; c) Section 5: several places: dynamics model -> the dynamics model; section 5.1: estimator -> an estimator 7. As recommended **[R-7X2a]**, we have added concrete discussion on how our work might be extended to function approximation.

Code: https://github.com/kkhetarpal/pomrl

Supplementary Material: zip

Assigned Action Editor: ~Pascal_Poupart2

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 723

Loading