## POMRL: No-Regret Learning-to-Plan with Increasing Horizons

### Khimya Khetarpal, Claire Vernade, Brendan O'Donoghue, Satinder Singh, Tom Zahavy

Changes Since Last Submission: **Revised draft includes the following changes.** 1. As recommended **[R-C8FM]**, we have listed the set of assumptions made formally in Section 3.1 and refer back to them as and when needed. We hope this improves the flow of the reading and overall comprehension. See Sec 3.1. 2. As suggested **[R-C8FM, R-pog9]**, we have added a real-world motivation for the problem setting. In many real world scenarios such as robotics, it is required to be responsive to changes in the environment and, at the same time, to be robust against perturbation inherent in the environment and their decision making. We have added this motivation around the text in both the introduction and Section 3.1.1 where we formally define the structural assumption across tasks. Having said that, ours is a theoretical paper that cannot (over) claim to solve real world tasks, so we try to distinguish between the motivation and the actual contributions. 3. As suggested **[R-C8FM]**, we have simplified Equation 4 including explanation of the more complex terms connecting to the text already around it. 4. As suggested **[R-C8FM]**, we have clarified the term underlying structure" during the first usage in the introduction. Moreover, to further improve comprehension, we have used the term task-'relatedness' interchangeably with task-'similarity' as opposed to structure where feasible. By underlying structure, we refer to how the tasks are related to each other. More specifically, how the transition dynamics across tasks are related. 5. As suggested **[R-C8FM]**, to clarify Figure 1, we have added to the caption that the blue dots indicate each task $P^{t}$. In left most figure, the red circle diameter represents the variance parameter $\sigma$ also known as the measure of task-similarity centered at mean $P^o$. The arrow is simply pointing to the mean of a Gaussian meta-learned model. Please see revised caption of Figure 1. Note that the aleatoric uncertainty on the transitions induced by each $P^t$ (that we upper bound by $v^2$ later) is not represented on this illustrative figure as it is a simple notation that does not imply any further assumption (in fact, $v^2\leq 0.25$ so it could be replaced by a constant everywhere). 6. As suggested **[R-C8FM]**, we have addressed the minor edits including a) Section 2: consequently to define -> consequently we define; b) Section 4: comes comes -> comes; Section 4: will gives a -> will give a; c) Section 5: several places: dynamics model -> the dynamics model; section 5.1: estimator -> an estimator 7. As recommended **[R-7X2a]**, we have added concrete discussion on how our work might be extended to function approximation.