{
       "Semester": "Fall 2019",
       "Question Number": "2",
       "Part": "c",
       "Points": 3.6,
       "Topic": "Reinforcement Learning",
       "Type": "Image",
       "Question": "You have designed a robot curling stone to enter a modified curling contest. 1 In an attempt to get your robot stone to perform well, you have designed a state and action space, a reward structure, and a transition model. The goal of the robot stone is to slide upwards on an ioe sheet and stop in a target region. Your robot stone likes to show off; after each state transition, it displays the reward it receives. In addition to your robot stone, there will be a number of opponent stones on the ice, as shown below. For simplicity's salke, we will consider the opponent stones to be fixed.\n(image here)\nYour model for the state and action spaces is as follows:\n$$\n\\begin{aligned}\n&S \\in\\{t, 0,1,2,3\\} \\\\\n&a \\in\\{\\text { \"go\", \"stop\" }\\}\n\\end{aligned}\n$$\nwhere the states refer to the robot stone being in either a terminal state (denoted as $t$ ) or within one of the four regions below: (image here)\n\n\nYou design the following reward function and (deterministic) transition model for your robot stone: (table here is an image)\nand all other transition probsbilities are 0 . Here * indicates any state. Note that once the robot stone enters state $t$ the game ends: there is no transition and zero reward out of state $t$ (and hence no action to decide once in state $t$.) Together with this reward function and transition model, you specify a discount factor $\\gamma=1$.\nUnfortunately, your competitor has also designed a robot stone. You do not know your competitor's reward structure $R(s, a)$ or transition model $T\\left(s, a, s^{\\prime}\\right)$; however, you do know they use the same state and actions spaces. Instead, you decide to use Q-learning to observe their robot stone and learn from it! For your Q-learning, use discount factor $\\gamma=1$ and learning rate $\\alpha=0.5$, with a $\\mathrm{Q}$ table initialized to zero for all $(s, a)$ pairs.\nYour competitor runs their robot through a second game, exhibiting the following additional experience:\n\\begin{tabular}{c|c|c|c|c} \nstep # & $s$ & $a$ & $r$ & $s^{\\prime}$ \\\\\n\\hline 3 & 0 & \"go\" & 1 & 1 \\\\\n4 & 1 & \u201cgo\" & 1 & 2 \\\\\n5 & 2 & \"go\" & 1 & 3 \\\\\n6 & 3 & \"stop\" & 2 & $t$\n\\end{tabular}\nYou perform additional Q-learning updates based on this additional experience. After completion of both games (all six steps), what are the full set of $Q$ values you have learned for their robot? Fill in the following table.\n(image here)\n",
       "Solution": "Image filling"
}