{
       "Semester": "Spring 2018",
       "Question Number": "6",
       "Part": "b",
       "Points": 2.0,
       "Topic": "Reinforcement Learning",
       "Type": "Text",
       "Question": "We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\\gamma=0.8$ and learning rate $\\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \\begin{array}{r} \\left(s_{0}, a_{2}, 0, s_{2}\\right) \\\\ \\left(s_{2}, a_{1}, 0, s_{3}\\right) \\\\ \\left(s_{3}, a_{1}, 0, s_{1}\\right) \\\\ \\left(s_{1}, a_{1}, 10, s_{0}\\right) \\\\ \\left(s_{0}, a_{1}, 0, s_{5}\\right) \\\\ \\left(s_{5}, a_{1}, 0, s_{4}\\right) \\\\ \\left(s_{4}, a_{1}, 5, s_{0}\\right) \\end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\\gamma=0.8$ and learning rate $\\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \\begin{array}{r} \\left(s_{0}, a_{2}, 0, s_{2}\\right) \\\\ \\left(s_{2}, a_{1}, 0, s_{3}\\right) \\\\ \\left(s_{3}, a_{1}, 0, s_{1}\\right) \\\\ \\left(s_{1}, a_{1}, 10, s_{0}\\right) \\\\ \\left(s_{0}, a_{1}, 0, s_{5}\\right) \\\\ \\left(s_{5}, a_{1}, 0, s_{4}\\right) \\\\ \\left(s_{4}, a_{1}, 5, s_{0}\\right) \\end{array} $$.  Iyaz suggests that, rather than getting new experience, it would be a good idea to replay this data over several times using the regular Q-learning update. What's the minimum number of times you would have to iterate through this data before $Q\\left(s_{0}, a_{2}\\right)>Q\\left(s_{0}, a_{1}\\right.$ ? Note: it should be possible to answer this question by thinking about the structure of the problem, rather than by grinding through more Q-learning update calculations.",
       "Solution": "4, including the first update whose values we recorded in the table)."
}