Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: knowledge distillation, evolutionary strategies, reinforcement learning, dataset distillation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We train RL policies by evolving a small synthetic dataset of state-action pairs and doing behaviour cloning on the dataset.
Abstract: Dataset distillation aims to condense large datasets into a small number of synthetic examples that can be used as drop-in replacements when training new models. It has applications to interpretability, neural architecture search, privacy, and continual learning. Despite strong successes in supervised domains, such methods have not yet been extended to reinforcement learning, where the lack of a fixed dataset renders most distillation methods unusable.
Filling the gap, we formalize $\textit{behaviour distillation}$, a setting that aims to discover and then condense the information required for training an expert policy into a synthetic dataset of state-action pairs, $\textit{without access to expert data}$.
We then introduce Hallucinating Datasets with Evolution Strategies (HaDES), a method for behaviour distillation that can discover datasets of $\textit{just four}$ state-action pairs which, under supervised learning, train agents to competitive performance levels in continuous control tasks.
We show that these datasets generalize out of distribution to training policies with a wide range of architectures and hyperparameters. We also demonstrate application to a downstream task, namely training multi-task agents in a zero-shot fashion.
Beyond behaviour distillation, HaDES provides significant improvements in neuroevolution for RL over previous approaches and achieves SoTA results on one standard supervised dataset distillation task. Finally, we show that visualizing the synthetic datasets can provide human-interpretable task insights.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Primary Area: reinforcement learning
Submission Number: 6255
Loading