{
       "Question number": "6",
       "Sub-Question number": "b",
       "Question": "The update rule for Q-learning is $Q(s,a) \\leftarrow Q(s,a) + \\alpha \\left[r + \\gamma + \\max\\limits_{a^{'}}Q(s^{'},a^{'} - Q(s,a) \\right]$,\nwhere $s\u2032$ is the state you actually enter after performing action a in state s and r is the reward you actually receive. Consider two states $S = {s1, s2}$ and actions $A = {a1, a2}$, and current Q-values. \\begin{tabular}{ c c c }\n& $a1$ & $a2$ \\\\ \n$s1$ & 3 & 2 \\\\ \n$s2$ & 4 & 6 \n\\end{tabular}\n\\newline\nSuppose the agent exploits in $s1$ and lands in $s2$. Which $Q$-value would be updated, and what is the value for $\\max\\limits_{a^{'}} Q(s\u2032, a\u2032)$ used in the update?",
       "Solution": "Because the agent exploits in s1, it takes action $a1$ from $s1$. Thus Q(s1, a1) will be updated. The value for $\\max\\limits_{a^{'}} Q(s\u2032, a\u2032)$ used in the update is $Q(s2, a2) = 6$."
}