{
       "Semester": "Spring 2021",
       "Question Number": "8",
       "Part": "c",
       "Points": 1.0,
       "Topic": "Reinforcement Learning",
       "Type": "Text",
       "Question": "You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).\nThe following transitions happen with 100% probability:\ns(3) to s(5)\ns(9) to s(1)\ns(8) to s(1)\ns(6) to s(2)\nYou have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i. Assume we know the rewards, but not the transition model. During learning, after executing the quit action and getting a reward, a new learning episode is initialized, after we go back to square 1. We will use a discount factor of \u03b3 = 1. If we do purely greedy action selection during Q-learning (that is $\\epsilon = 0$), starting from all 0\u2019s in our Q table and where ties are broken in favor of the quit action, what (roughly) will the Q function be after 1000 steps?",
       "Solution": "It will be all 0 except Q(s1, quit) = 1"
}