{
       "Semester": "Fall 2019",
       "Question Number": "2",
       "Part": "e",
       "Points": 3.6,
       "Topic": "Reinforcement Learning",
       "Type": "Image",
       "Question": "You have designed a robot curling stone to enter a modified curling contest. 1 In an attempt to get your robot stone to perform well, you have designed a state and action space, a reward structure, and a transition model. The goal of the robot stone is to slide upwards on an ioe sheet and stop in a target region. Your robot stone likes to show off; after each state transition, it displays the reward it receives. In addition to your robot stone, there will be a number of opponent stones on the ice, as shown below. For simplicity's salke, we will consider the opponent stones to be fixed.\n(image here)\nYour model for the state and action spaces is as follows:\n$$\n\\begin{aligned}\n&S \\in\\{t, 0,1,2,3\\} \\\\\n&a \\in\\{\\text { \"go\", \"stop\" }\\}\n\\end{aligned}\n$$\nwhere the states refer to the robot stone being in either a terminal state (denoted as $t$ ) or within one of the four regions below: (image here)\n\n\nYou design the following reward function and (deterministic) transition model for your robot stone: (table here is an image)\nand all other transition probsbilities are 0 . Here * indicates any state. Note that once the robot stone enters state $t$ the game ends: there is no transition and zero reward out of state $t$ (and hence no action to decide once in state $t$.) Together with this reward function and transition model, you specify a discount factor $\\gamma=1$.\nUnfortunately, your competitor has also designed a robot stone. You do not know your competitor's reward structure $R(s, a)$ or transition model $T\\left(s, a, s^{\\prime}\\right)$; however, you do know they use the same state and actions spaces. Instead, you decide to use Q-learning to observe their robot stone and learn from it! For your Q-learning, use discount factor $\\gamma=1$ and learning rate $\\alpha=0.5$, with a $\\mathrm{Q}$ table initialized to zero for all $(s, a)$ pairs.\nUnfortunately, your robot's GPS system suddenly breaks, and it is no longer able to tell which of the four regions it is in. However, the robot has side cameras which can detect the opponent stones as it travels through the center of the ice, encoded as [(number of stones to immediate left) (number of stones to immediate right) $]^{T}$. You decide to use this information as state, giving the following feature transformation $\\phi_{B}$ on your original state:\n$$\n\\begin{aligned}\n\\phi_{B}(3) &=\\left[\\begin{array}{ll}\n1 & 1\n\\end{array}\\right]^{T} \\\\\n\\phi_{B}(2) &=\\left[\\begin{array}{ll}\n0 & 0\n\\end{array}\\right]^{T} \\\\\n\\phi_{B}(1) &=\\left[\\begin{array}{ll}\n1 & 0\n\\end{array}\\right]^{T} \\\\\n\\phi_{B}(0) &=\\left[\\begin{array}{ll}\n0 & 1\n\\end{array}\\right]^{T}\n\\end{aligned}\n$$\nWe would still like to come up with parameters $\\theta, \\theta_{0}$ such that $Q\\left(s, \" g \\mathrm{go}^{\"}\\right)=\\theta \\cdot \\phi_{B}(s)+\\theta_{0}$, for general values of $Q\\left(s\\right.$, \"go\" ). Is there a setting of $\\theta, \\theta_{0}$ that enables representation of this encoding of $Q\\left(s, \" g o^{\"}\\right)$ with perfect accuracy? If so, provide the corresponding $\\theta$ and $\\theta_{0}$. If not, explain why this is not possible, and provide a feature transformation $\\phi_{C}(\\cdot)$ that does enable representation of $Q\\left(s, \" g 0^{\\prime \\prime}\\right)=\\theta \\cdot \\phi_{C}\\left(\\phi_{B}(s)\\right)+\\theta_{0}$ with perfect accuracy.",
       "Solution": "No. Let $\\left[\\begin{array}{ll}x_{1} & x_{2}\\end{array}\\right]=\\phi_{B}(s)$, so $\\theta_{1} x_{1}+\\theta_{2} x_{2}+\\theta_{0}=Q(s$, \"go\" $) . \\phi_{B}(2)$ forces $\\theta_{0}=Q(2$, \"go\" $) ; \\phi_{B}(1)$ forces $\\theta_{1} ; \\phi_{B}(0)$ forces $\\theta_{2}$; and we no longer have the ability to find $\\theta$ for $\\phi_{B}(3)$.\n\nWe can create $\\phi_{C}$ as a one-hot encoding of state such that $\\phi_{C}\\left(\\phi_{B}(s)\\right)=\\phi_{A}(s)$ to uniquely identify our four states (with corresponding $\\theta$ and $\\theta_{0}$ as in the previous part) to regain perfect representationsl power.\n"
}