Making a game out of exploration-exploitation

TMLR Paper1070 Authors

18 Apr 2023 (modified: 25 Jul 2023)Rejected by TMLREveryoneRevisionsBibTeX
Abstract: What is the best way for an agent to balance exploration with exploitation? In this paper we suggest an answer to this question that treats exploration and exploitation as independent players competing to maximize a joint objective. Through theory and simulations we show how a ``game'' played between two deterministic policies, one maximizing intrinsic curiosity and one maximizing extrinsic environmental rewards, yields a simple maximum value solution over both policies. The key assumption that allows for this is our assumption that curiosity and reward seeking are equally valuable on evolutionary terms. We start by developing an axiomatic approach to defining information value that generalizes past approaches, while simplifying our ability to estimate such value in both artificial and biological memory systems. We then show how our deterministic solution performs at least as well as standard stochastic explore-exploit algorithms, but has the added benefit of being far more resilient to deceptive rewards (i.e., local minima), more efficient in high-dimensional action contexts, and robust to hyperparameter choices. Thus, the solution to our version of the gamified exploration-exploitation problem can be summarized by a simple heuristic: when the expected value of information is more than the expected value of rewards, be curious, otherwise seek rewards.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=a23CrwabJh&referrer=%5Bthe%20profile%20of%20Erik%20J%20Peterson%5D(%2Fprofile%3Fid%3D~Erik_J_Peterson1)
Changes Since Last Submission: Among other improvements, in response to the reviewers comments and requests we have: - Rewritten the introduction to better tie it together with the rest of the paper. - And focused on making more clearly connections throughout the draft - Addressed several technical criticisms and typos in the mathematical formalism - We tied the information collection and axioms closer to the rest of the paper, both in terms of their motivation in the introduction and in their sections proper - We better described our motivations in using a WSLS rule and the full formalism we first developed. - Discussed our conjectures, their motivation, the possibility of proving them, and greatly improved our overall motivation in studying curiosity and reward as independent but competing motivations for artificial agents.
Assigned Action Editor: ~Josh_Merel1
Submission Number: 1070
Loading