{
       "Question number": "6",
       "Sub-Question number": "e",
       "Question": "The update rule for Q-learning is as follows, where $\\alpha$ is the learning rate and $\\gamma$ the discount factor:\n$$\nQ(s, a) \\leftarrow Q(s, a)+\\alpha\\left(r+\\gamma \\cdot \\max _{a^{\\prime}} Q\\left(s^{\\prime}, a^{\\prime}\\right)-Q(s, a)\\right) .\n$$\nSuppose the agent takes action left in state1, and transitions to state2. Which $Q$-value (or $Q$-values) is updated, and with which action $a^{\\prime}$ and state $s^{\\prime}$ ?\n",
       "Solution": "the agent updates Q(state1, left), and uses the value of Q(state2, right) for the update, i.e. adopting s'=state2 and a'=right"
}