{
       "Question number": "6",
       "Sub-Question number": "a",
       "Question": "The update rule for Q-learning is $Q(s,a) \\leftarrow Q(s,a) + \\alpha \\left[r + \\gamma + \\max\\limits_{a^{'}}Q(s^{'},a^{'} - Q(s,a) \\right]$,\nwhere $s\u2032$ is the state you actually enter after performing action a in state s and r is the reward you actually receive. Consider two states $S = {s1, s2}$ and actions $A = {a1, a2}$, and current Q-values. \\begin{tabular}{ c c c }\n& $a1$ & $a2$ \\\\ \n$s1$ & 3 & 2 \\\\ \n$s2$ & 4 & 6 \n\\end{tabular}\n\\newline\nSuppose the agent is in state s1. Using $\\epsilon$-greedy, how would it decide to act?",
       "Solution": "The best action a1 is selected with probability 1 \u2212 $\\epsilon$. An action is selected at random (with uniform probability) with probability $\\epsilon$."
}