{
       "Semester": "Spring 2022",
       "Question Number": "7",
       "Part": "d",
       "Points": 2.0,
       "Topic": "Reinforcement Learning",
       "Type": "Text",
       "Question": "Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\\left(s, a, r, s^{\\prime}\\right)$ : $$ Q(s, a):=(1-\\alpha) Q(s, a)+\\alpha\\left(r+\\gamma \\max _{a^{\\prime}} Q\\left(s^{\\prime}, a^{\\prime}\\right)\\right) $$ Let $\\alpha=1$. Assume we see the following state-action-reward sequence: \nA, Move, 0 \nB, Move, 0 \nC, Move, 1 \nA, Move, 0 \nB, Move, 0. \nA, Move, 0\nB, Move, 0\nWith Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. What problem with our algorithm is revealed by this example? Very briefly explain a small change to the method or parameters we are using that will solve this problem.",
       "Solution": "Use a smaller learning rate"
}