{
       "Question number": "3",
       "Sub-Question number": "d",
       "Question": "Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\\left(s, a, r, s^{\\prime}\\right)$ : $$ Q(s, a):=(1-\\alpha) Q(s, a)+\\alpha\\left(r+\\gamma \\max _{a^{\\prime}} Q\\left(s^{\\prime}, a^{\\prime}\\right)\\right) $$ Let $\\alpha=1$. Assume we see the following state-action-reward sequence: \nA, Move, 0 \nB, Move, 0 \nC, Move, 1 \nA, Move, 0 \nB, Move, 0. \nA, Move, 0\nB, Move, 0\nWith Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. What problem with our algorithm is revealed by this example? Very briefly explain a small change to the method or parameters we are using that will solve this problem.",
       "Solution": "Use a smaller learning rate"
}