{
       "Question number": "3",
       "Sub-Question number": "b",
       "Question": "Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\\left(s, a, r, s^{\\prime}\\right)$ : $$ Q(s, a):=(1-\\alpha) Q(s, a)+\\alpha\\left(r+\\gamma \\max _{a^{\\prime}} Q\\left(s^{\\prime}, a^{\\prime}\\right)\\right) $$ Let $\\alpha=1$. Assume we see the following state-action-reward sequence: \nA, Move, 0 \nB, Move, 0 \nC, Move, 1 \nA, Move, 0 \nB, Move, 0. With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. Characterize the weakness of Q-learning demonstrated by this example, which would be worse if there were a long sequence of states $B_{1}, \\ldots, B_{100}$ between A and C. Very briefly describe a strategy for overcoming this weakness. ",
       "Solution": "It doesn't propagate the value all the way back the chain. Do the updates backward along the trajectory; or save your experience and replay it."
}