Universal reinforcement learning

Vivek F. Farias, Ciamac Cyrus Moallemi, Benjamin Van Roy, Tsachy Weissman

Published: 2010, Last Modified: 15 May 2023IEEE Trans. Inf. Theory 2010Readers: Everyone

Abstract: We consider an agent interacting with an unmodeled environment. At each time, the agent makes an observation, takes an action, and incurs a cost. Its actions can influence future observations and costs. The goal is to minimize the long-term average cost. We propose a novel algorithm, known as the active LZ algorithm, for optimal control based on ideas from the Lempel-Ziv scheme for universal data compression and prediction. We establish that, under the active LZ algorithm, if there exists an integer <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">K</i> such that the future is conditionally independent of the past given a window of <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">K</i> consecutive actions and observations, then the average cost converges to the optimum. Experimental results involving the game of Rock-Paper-Scissors illustrate merits of the algorithm.

0 Replies