{
       "Semester": "Spring 2018",
       "Question Number": "6",
       "Part": "d",
       "Points": 2.0,
       "Topic": "Reinforcement Learning",
       "Type": "Text",
       "Question": "We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\\gamma=0.8$ and learning rate $\\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \\begin{array}{r} \\left(s_{0}, a_{2}, 0, s_{2}\\right) \\\\ \\left(s_{2}, a_{1}, 0, s_{3}\\right) \\\\ \\left(s_{3}, a_{1}, 0, s_{1}\\right) \\\\ \\left(s_{1}, a_{1}, 10, s_{0}\\right) \\\\ \\left(s_{0}, a_{1}, 0, s_{5}\\right) \\\\ \\left(s_{5}, a_{1}, 0, s_{4}\\right) \\\\ \\left(s_{4}, a_{1}, 5, s_{0}\\right) \\end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\\gamma=0.8$ and learning rate $\\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \\begin{array}{r} \\left(s_{0}, a_{2}, 0, s_{2}\\right) \\\\ \\left(s_{2}, a_{1}, 0, s_{3}\\right) \\\\ \\left(s_{3}, a_{1}, 0, s_{1}\\right) \\\\ \\left(s_{1}, a_{1}, 10, s_{0}\\right) \\\\ \\left(s_{0}, a_{1}, 0, s_{5}\\right) \\\\ \\left(s_{5}, a_{1}, 0, s_{4}\\right) \\\\ \\left(s_{4}, a_{1}, 5, s_{0}\\right) \\end{array} $$ Now, imagine that you have a very large memory full of experience that explores the whole environment effectively. In the following questions, you can assume that a neural network $\\mathrm{nn}$ has the following methods:\n- nn.train (data, epochs) takes a list of $(x, y)$ pairs and does supervised training on it for the specified number of epochs. If epochs is None it trains until convergence.\n- $\\mathrm{nn}$.predict $(\\mathrm{x})$ takes an input $x$ and returns the predicted $y$ value\n- nn.init() randomly reassigns all the weights in the network.\nConsider the following code.\n$\\# \\mathrm{nn}$ = dictionary of neural networks, one for each action\n# each nn[a] maps state s into $Q(s, a)$\ngamma $=0.8$\nfor $t$ in range(max_iterations):\nfor a in actions: data[a] $=[(s,<$ fill in>) for (s, a_prime, $r$, s_prime) in memory if a_prime $==a]$\nfor a in actions:\n$\\mathrm{nn}[\\mathrm{a}]$.train (data[a], <epochs>)\n If we change the loop to have the form for t in range(max_iterations): for $\\left(s, a, r, s^{\\prime}\\right)$ in memory: data $=[(s,<f i l l$ in $\\rangle)] \\quad \\#$ a single data point $\\mathrm{nn}[\\mathrm{a}]$. $\\operatorname{train}$ (data, <epochs>) Provide a value for <epochs $>$ above that will cause this algorithm to converge to a correct solution oxplain why no such value exists. ",
       "Solution": "1. With 1 epoch we will look at every piece of experience in memory once per iteration"
}