{
       "Semester": "Fall 2019",
       "Question Number": "2",
       "Part": "b",
       "Points": 3.6,
       "Topic": "Reinforcement Learning",
       "Type": "Image",
       "Question": "You have designed a robot curling stone to enter a modified curling contest. 1 In an attempt to get your robot stone to perform well, you have designed a state and action space, a reward structure, and a transition model. The goal of the robot stone is to slide upwards on an ioe sheet and stop in a target region. Your robot stone likes to show off; after each state transition, it displays the reward it receives. In addition to your robot stone, there will be a number of opponent stones on the ice, as shown below. For simplicity's salke, we will consider the opponent stones to be fixed.\n(image here)\nYour model for the state and action spaces is as follows:\n$$\n\\begin{aligned}\n&S \\in\\{t, 0,1,2,3\\} \\\\\n&a \\in\\{\\text { \"go\", \"stop\" }\\}\n\\end{aligned}\n$$\nwhere the states refer to the robot stone being in either a terminal state (denoted as $t$ ) or within one of the four regions below: (image here)\n\n\nYou design the following reward function and (deterministic) transition model for your robot stone: (table here is an image)\nand all other transition probsbilities are 0 . Here * indicates any state. Note that once the robot stone enters state $t$ the game ends: there is no transition and zero reward out of state $t$ (and hence no action to decide once in state $t$.) Together with this reward function and transition model, you specify a discount factor $\\gamma=1$.\nc\nYour competitor runs their robot through a first game, exhibiting the following experience:\n\\begin{tabular}{c|c|c|c|c} \nstep # & $s$ & $a$ & $r$ & $s^{\\prime}$ \\\\\n\\hline 1 & 0 & \"go\" & 1 & 1 \\\\\n2 & 1 & \"stop\" & 0 & $\\mathrm{t}$\n\\end{tabular}\nYou perform Q-learning updates based on the experience above. After observing steps 1 and 2 (the first game), what is the learned $Q(0$, \"go\" $)$ ?\n\nSolution: We know $Q(s, a):=\\alpha Q(0, a)+\\alpha\\left(r+\\gamma \\max _{a^{4}} Q\\left(0, a_{i}\\right)\\right.$ So step #1 causes the following update:\n$$\nQ(0, \" \\mathrm{go} \")=0.5 \\cdot 0+0.5(1+1 \\cdot 0)=0.5\n$$\nWhat is the learned $Q(1$, \"stop\" $)$ ?",
       "Solution": "$$\nQ(1, \\text { \"stop\" })=0.5 \\cdot 0+0.5(0+1 \\cdot 0)=0\n$$"
}