{
       "Semester": "Fall 2019",
       "Question Number": "5",
       "Part": "g",
       "Points": 3.0,
       "Topic": "MDPs",
       "Type": "Image",
       "Question": "Consider the following simple MDP: Positive Reward\nFirst consider the case where the MDP has positive reward. In this scenario, there is only one action $($ next $)$; we name this decision policy $\\pi_{A}$ with $\\pi_{A}(s)=n e x t$ for all $s$. The reward is $R(s, n e x t)=0$ for all states $s$, except for state $s_{k}$ where reward is $R\\left(s_{k}\\right.$, next $)=10$. We always start at state $s_{1}$ and each arrow indicates a deterministic transition probability $p=1$. There is no transition out of the end state $E N D$, and 0 reward for any action from the end state.\nConsider the MDP below with negative rewards for some $R(s, a)$ and positive rewards for others. Now there are two actions, next and stop. The solid arrows show the probabilities of state transitions under action next; the dashed arrows show the probability of state transitions under action stop. (If there is no dashed arrow from a state, that indicates a probability $p=0$ of transitioning out of that state under action stop.) The corresponding rewards $R\\left(s_{i}, a\\right)$ are also indicated on the figure below. Note that the rewards are $R\\left(s_{i}, n e x t\\right)=-1$ for all $s_{i}$, except for state $s_{4}$, where the reward is $R\\left(s_{4}\\right.$, next $)=10$. Finally, under action stop, we have reward $R\\left(s_{1}\\right.$, stop $)=r$ (some unknown value $r$ ), and $R(s, s t o p)=0$ for all other states. As before, we always start in state $s_{1}$. There is no transition out of the end state $E N D$, and zero reward for any action from the end state, i.e., $R(E N D$, next $)=R(E N D, g o)=0$. Assume discount factor $\\gamma$ and infinite horizon.\nWe consider two possible policies: $\\pi_{A}(s)=n e x t$ for all $s$, and $\\pi_{B}(s)=s t o p$ for all $s$. Your goal is to maximize your reward. When you start at $s_{1}$, you have reward 0 before taking any actions. Determine what $r$ should be, so that it is best to run this MDP under policy $\\pi_{B}$ rather than policy $\\pi_{A}$. Give your answer as an expression for $r$ involving $p$ and $\\gamma$.",
       "Solution": "Under policy $\\pi_{A}$ :\n$$\n\\begin{aligned}\n&V_{\\pi}\\left(s_{4}\\right)=10 \\\\\n&V_{\\pi}\\left(s_{3}\\right)=-1+p \\gamma V_{\\pi}\\left(s_{4}\\right)+(1-p) \\gamma V_{\\pi}(\\text { end })=-1+p \\gamma \\cdot 10 \\\\\n&V_{\\pi}\\left(s_{2}\\right)=-1+p \\gamma V_{\\pi}\\left(s_{3}\\right)=-1-p \\gamma+(p \\gamma)^{2} \\cdot 10 \\\\\n&V_{\\pi}\\left(s_{1}\\right)=-1+p \\gamma V_{\\pi}\\left(s_{2}\\right)=-1-p \\gamma-(p \\gamma)^{2}+(p \\gamma)^{3} \\cdot 10\n\\end{aligned}\n$$\nUnder policy $\\pi_{B}$, we simply have $V_{\\pi}\\left(s_{1}\\right)=r$. So we should choose policy $\\pi_{B}$ when\n$$\nr>-1-p \\gamma-(p \\gamma)^{2}+(p \\gamma)^{3} \\cdot 10\n$$\nAs an example, for $\\gamma=1$ and $p=0.9, r$ is $4.58$."
}