{
       "Semester": "Fall 2019",
       "Question Number": "2",
       "Part": "d",
       "Points": 3.6,
       "Topic": "Reinforcement Learning",
       "Type": "Image",
       "Question": "You have designed a robot curling stone to enter a modified curling contest. 1 In an attempt to get your robot stone to perform well, you have designed a state and action space, a reward structure, and a transition model. The goal of the robot stone is to slide upwards on an ioe sheet and stop in a target region. Your robot stone likes to show off; after each state transition, it displays the reward it receives. In addition to your robot stone, there will be a number of opponent stones on the ice, as shown below. For simplicity's salke, we will consider the opponent stones to be fixed.\n(image here)\nYour model for the state and action spaces is as follows:\n$$\n\\begin{aligned}\n&S \\in\\{t, 0,1,2,3\\} \\\\\n&a \\in\\{\\text { \"go\", \"stop\" }\\}\n\\end{aligned}\n$$\nwhere the states refer to the robot stone being in either a terminal state (denoted as $t$ ) or within one of the four regions below: (image here)\n\n\nYou design the following reward function and (deterministic) transition model for your robot stone: (table here is an image)\nand all other transition probsbilities are 0 . Here * indicates any state. Note that once the robot stone enters state $t$ the game ends: there is no transition and zero reward out of state $t$ (and hence no action to decide once in state $t$.) Together with this reward function and transition model, you specify a discount factor $\\gamma=1$.\nUnfortunately, your competitor has also designed a robot stone. You do not know your competitor's reward structure $R(s, a)$ or transition model $T\\left(s, a, s^{\\prime}\\right)$; however, you do know they use the same state and actions spaces. Instead, you decide to use Q-learning to observe their robot stone and learn from it! For your Q-learning, use discount factor $\\gamma=1$ and learning rate $\\alpha=0.5$, with a $\\mathrm{Q}$ table initialized to zero for all $(s, a)$ pairs.\nWe can think of learning the Q-value function for a given action as a regression problem with each state $s$ mapped to a one-hot feature vector $x=\\phi_{A}(s)$, where $x=\\left[\\begin{array}{lll}1 & 0 & 0\\end{array}\\right.$ state $0, x=\\left[\\begin{array}{llll}0 & 1 & 0 & 0\\end{array}\\right]^{T}$ for 1 , etc., and $x=\\left[\\begin{array}{llll}0 & 0 & 0 & 0\\end{array}\\right]^{T}$ for state $t$.\n\nWe'll focus on the action \"go\". We would like to come up with parameters $\\theta, \\theta_{0}$ such that $Q\\left(s, \" g o^{\\prime \\prime}\\right)=\\theta \\cdot \\phi_{A}(s)+\\theta_{0}=\\theta \\cdot x+\\theta_{0}$. Is there in general - for arbitrary values of our $Q(s$, \"go\" $)$ - a setting of $\\theta, \\theta_{0}$ that enables representation of $Q(s$, \"go\") with perfect accuracy? If so, provide the corresponding $\\theta$ and $\\theta_{0}$. If not, explain why. (Note that we do not need to model $Q(t, a)$, since the game is over once state $t$ has been reached.)",
       "Solution": "Yes; $\\theta_{i}$ is simply the value for $Q(s=i$, \"go\" $)$ and $\\theta_{0}=0$.\nNote: $\\theta=\\left[\\begin{array}{llll}5 & 4 & 3 & 0\\end{array}\\right]^{T}$ and $\\theta_{0}=0$ would work for our optimal $Q^{*}(s, a)$, but we seek a more general $\\theta$ corresponding to arbitrary or general $Q(s, a)$."
}