Summary: The paper presents a simple hierarchical approach for non-Markovian imitation learning along with a rigorous theoretical analysis.
The method trains a high level policy, given by a Denoising Diffusion Probabilistic Model (DDPM), to predict a sequence of linear control gains based on a state-action history.
The DDPM is trained in supervised fashion (using the standard DDPM loss) where the labels (the controller gains) are computed from the given expert trajectories using an oracle (e.g. based on linearizations along the trajectory). To address covariate shift (compounding errors) in imitation learning the method applies a form of data augmentation by adding noise to the state-action history. In contrast to prior work, these perturbations are also applied during inference, which is theoretically motivated and empirically shown to improve imitation performance.
The main contributions seems to be the thorough analysis of this method, which leads to Theorem 1, a bound on the imitation performance.
The applicability of the proposed method is demonstrated on simulated robot tasks, although no comparisons with prior work are provided.
Soundness: 3 good
Presentation: 2 fair
Contribution: 3 good
Strengths: Novelty & Relevance
-------------------
Both the method and in particular the analysis seem to be novel and non-Markovian imitation learning is an important field of research. I would not expect the proposed method to outperform other (theoretically less well understood) methods, but I do think that the general approach of using a diffusion model as a high-level policy to predict controller gains could be used by future work that is more focused on practical performance. Furthermore, the theoretical analysis could be valuable for theorists in the field of imitation learning. However, I am not very familiar with related theoretical work and cannot assess well, which aspects/techniques of the analysis might be most relevant.
Soundness
---------
The algorithm seems sounds. I guess one could argue that it might be simpler and more stable if the higher-level policy would predict a trajectory segment instead of the gains, given that we assume access to an oracle that can provide us with stabilizing gains. However, querying the oracle could be too costly, so directly predicting the gains can be sensible.
I cannot fully confirm the accuracy of the theoretical results because verifying the complete derivations would take significant effort. However, the claims (including Theorem 1) seem reasonable, apart from non-critical issues (see Questions below).
Clarity
-------
The paper is well-written. The paper is very strict in terms of notation, definitions, and statements, which increases clarity by reducing ambiguities.
Weaknesses: Clarity
-------
Although I mentioned the rigor as a strength, it also feels a bit pedantic. Furthermore, it is cumbersome to keep track of the different constants and definitions making the paper hard to read and making it easy to get lost in details.
Evaluation
----------
Despite the fact that the contributions are mainly in the theoretical analysis, quantitative comparisons with alternate methods that have been used in these environments would be useful. I don't know which prior work was used in that environment that also provided code, but I assume it should be possible to find suitable baselines that can be easily tested.
Questions: Line 216 states that "each term in (3.2) can be made arbitrarily small by decreasing the amount of noise $\sigma$ in the augmentation [...]". Isn't the first term anti-proportional to the noise?
Line 274: "all sequences $s'\_1 = s\_1$ and $s'\_{h+1} = s\_{h+1}$ [...]". I'm wondering if this should be "all sequences $s'\_h = s\_h$ and $s'\_{h+1} = s\_{h+1}$ [...]"!?
Line 325: "We let the expert policy $\pi^*$ be the concatenation of policies $\pi\_h^*$ [...]". Is this an additional assumption? Clearly, the expert will in general not apply this particular form of a hierarchical policy.
Limitations: I think that the limitations are sufficiently clear.
Flag For Ethics Review: No ethics review needed.
Rating: 6: Weak Accept: Technically solid, moderate-to-high impact paper, with no major concerns with respect to evaluation, resources, reproducibility, ethical considerations.
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.
Code Of Conduct: Yes