Keywords: PAC - MDP, Information Theory, Unsupervised Skill Discovery
TL;DR: We present both theoretical evidence, derived through mathematical proofs, and experimental results that demonstrate curiosity inherently drives optimization in a reinforcement learning environment.
Abstract: In PAC theory, it is posited that larger hypothesis spaces necessitate more independently and identically distributed (i.i.d) data to maintain the accuracy of model performance. PAC-MDP theory defines curiosity by assigning higher rewards for visiting states that are far from the previously visited trajectory, which supports more independent and i.i.d data collection. Recently, this field has witnessed attempts to narrow the hypothesis space by developing additional mechanisms that train multiple skills and facilitate the sharing of information among them, thereby discovering commonalities. However, one might wonder: What if curiosity could not only enhance the efficiency of data collection but also significantly reduce the hypothesis space, thereby driving optimal outcomes independently without additional mechanism used in PAC-MDP? Significant discussion has been devoted to the reduction of hypothesis spaces and the utilization of curiosity. Within this context, contrastive multi-skill reinforcement learning (RL) exhibits both traits. Previous research in contrastive multi-skill RL has utilized this technique primarily as a form of pretraining, However, there has been scant investigation into whether the technique itself can reduce the hypothesis space to optimize the outcomes. We have mathematically proven that curiosity provides bounds to guarantee optimality in contrastive multi-skill reinforcement learning (RL). Additionally, we have leveraged these findings to develop an algorithm that is applicable in real-world scenarios, which has been demonstrated to surpass other prominent algorithms. Furthermore, our experiments have shown that different skills are actually reducing the hypothesis space of the policy by being hierarchically grouped.
Supplementary Material: zip
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2317
Loading