{
       "Semester": "Spring 2018",
       "Question Number": "6",
       "Part": "c.i",
       "Points": 1.0,
       "Topic": "Reinforcement Learning",
       "Type": "Text",
       "Question": "We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\\gamma=0.8$ and learning rate $\\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \\begin{array}{r} \\left(s_{0}, a_{2}, 0, s_{2}\\right) \\\\ \\left(s_{2}, a_{1}, 0, s_{3}\\right) \\\\ \\left(s_{3}, a_{1}, 0, s_{1}\\right) \\\\ \\left(s_{1}, a_{1}, 10, s_{0}\\right) \\\\ \\left(s_{0}, a_{1}, 0, s_{5}\\right) \\\\ \\left(s_{5}, a_{1}, 0, s_{4}\\right) \\\\ \\left(s_{4}, a_{1}, 5, s_{0}\\right) \\end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\\gamma=0.8$ and learning rate $\\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \\begin{array}{r} \\left(s_{0}, a_{2}, 0, s_{2}\\right) \\\\ \\left(s_{2}, a_{1}, 0, s_{3}\\right) \\\\ \\left(s_{3}, a_{1}, 0, s_{1}\\right) \\\\ \\left(s_{1}, a_{1}, 10, s_{0}\\right) \\\\ \\left(s_{0}, a_{1}, 0, s_{5}\\right) \\\\ \\left(s_{5}, a_{1}, 0, s_{4}\\right) \\\\ \\left(s_{4}, a_{1}, 5, s_{0}\\right) \\end{array} $$ Now, imagine that you have a very large memory full of experience that explores the whole environment effectively. In the following questions, you can assume that a neural network $\\mathrm{nn}$ has the following methods:\n- nn.train (data, epochs) takes a list of $(x, y)$ pairs and does supervised training on it for the specified number of epochs. If epochs is None it trains until convergence.\n- $\\mathrm{nn}$.predict $(\\mathrm{x})$ takes an input $x$ and returns the predicted $y$ value\n- nn.init() randomly reassigns all the weights in the network.\nConsider the following code.\n$\\# \\mathrm{nn}$ = dictionary of neural networks, one for each action\n# each nn[a] maps state s into $Q(s, a)$\ngamma $=0.8$\nfor $t$ in range(max_iterations):\nfor a in actions: data[a] $=[(s,<$ fill in>) for (s, a_prime, $r$, s_prime) in memory if a_prime $==a]$\nfor a in actions:\n$\\mathrm{nn}[\\mathrm{a}]$.train (data[a], <epochs>)\nWhat is a correct expression for <fill in> above?",
       "Solution": " $r+0.8 * \\max$ ([nn[a_prime].predict(s_prime) for a_prime in actions])."
}