Reinforcement Learning Part 2: Deep Q Learning (DQN)

This notebook continues the policy gradient notebook with Q learning. We will use the CartPole-v0 task. To make the notebook self-contained, we repeat the background cells below.

In the CartPole problem, the cart is pushed to the right or left by a force of +1 or -1, and the goal is to prevent the attached pole from falling over.

jupyter

At each step $t$, our agent has to decide on an action $a_t \in \{0, 1\}$ - moving the cart left or right - based on the current state $s_t$ of the environment. The state space is a 4-D space, i.e. $s_t \in \mathbb{R}^{4}$ that describes the state of the cart, and they are:

Num Observation Min Max
0 Cart Position -2.4 2.4
1 Cart Velocity -Inf Inf
2 Pole Angle ~ -41.8° ~ 41.8°
3 Pole Velocity At Tip -Inf Inf

Given the action $a_t$, the environment will transition to a new state $s_{t+1}$ and also return a reward $r_t \in \{0, +1\}$ that indicates the consequences of the action. That is, a reward of +1 is provided for every timestep that the pole remains upright, and 0 means the environment terminates i.e. the pole tips too far or the cart moves too far away from center. A good policy (of the agent), called $\pi_\theta$, balances the pole as long as it can. The policy tells agent which action $a$ to take in state $s$ by output a conditional probability distribution over actions: $\pi(a|s; \theta) = p_{\theta}(A=a|S=s)$.

The goal of RL is to find $\pi^*$, the optimal policy, that maximizes total rewards. Let's first set up the environment:

Some more utility function (i.e. plotting) and setups.

Deep Q-Learning

Recall policy gradient notebook we have described an approach to directly optimize a policy online, meaning we learn while we explore. Q-learning takes a different approach.

Just as in PG, Q-learning's goal is to estimate an optimal action-value function $Q^*: \text{State} \times \text{Action} \rightarrow \mathbb{R}$, that can tell us the maximum expected return by taking an action $a$ in a given state $s$. Give $Q^*$, constructing the optimal policy $\pi^*$ is easy, since all need to do is to pick the action that maximizes our rewards:

$$ \pi^*(s) = \arg \max_a Q^*(s,a)$$

Therefore, the task reduces to learning $Q^*$, this tutorial uses the algorithm in Human-level control through deep reinforcement learning, known as DQN, where $Q^*$ is a neural network with learnable parameters $\theta$ takes $s$ and $a$ as inputs and output a vector that maps the actions (i.e. left or right) to their corresponding $Q$ value. In this example, a simple multilayer perceptron (MLP) is used as the $Q(s, a; \theta)$. Writing it out in Kokoyi is easy, note the parameters are encapsulated in the two-layer MLP passed in after ";". Note that 1) $a$ is missing since we compute a vector indexed by $a$ and 2) we don't unpack the MLP module but directly reference them for the computation (you can certainly do that, too): <!-- $$ Q^*(s, \text{left}), Q^*(s, \text{right}) = DQN(s; \theta) $$

In this example, a simple multilayer perceptron (MLP) is used as the $DQN$. -->

The optimal action-value function $Q^*$ obeys an important invariant known as the Bellman equation:

$$ Q^*(s,a) = r + \gamma \max_{a'}Q^*(s_{next}, a')$$

where $s$, $s_{next}$ is the current state and the next state, $r$ is the reward of the action $a$. In plain English, this basically says that the optimal action-value function of the action $a$ is the immediate reward $r$, plus the best value among all the actions in the next state that the agent will be in, and the second term is multiplied by a discount factor $\gamma$. $\gamma$ describes how greedy we want the policy to be, if $\gamma$ is 0 then the agent is only interested in the immediate reward at this step, ignoring the future, whereas larger $\gamma$ forces the agent to pay attention to future return.

Since $Q(s;\theta)$ is a function approximate to $Q^*$, we can train the network by minimizing the loss function $L(\theta)$,

$$ L(\theta) = [Q(s; \theta, a) - (r + \gamma\max_{a'}Q(s_{next}; \theta^{-}, a'))]^2 $$

In DQN, we have two networks, the main network $Q$ is where we compute the loss and update, and a slowly updated target network $Q_t$ to compute the target value (the second term in the loss function). These two networks are identical other than their parameters: $Q$ has $\theta$ whereas the $Q_t$ has $\theta^{-}$. $\theta^{-}$ is frozen most of the time, but takes a copy of $\theta$ every $C$ steps.

The Kokoyi code is straightforward, we use mean square error to compute the loss.

Now let's define the process that an agent picks an action. If our agent always selects the action that gives the maximum (estimated) return, it may fail to explore new options which might pay off in the long term. This is not going to be fixed by a larger discount factor $\gamma$ since all it does is punishing all actions in the next state uniformly.

DQN employs another classical trick called epsilon greedy algorithm, which balances exploration (of unknown territories) and exploitation (of the current best options). The idea is simple, with a controlled parameter, the agent will just pick an action randomly. See relevant section describing the epsilon greedy policy in the reinforcement learning (RL) tutorial of PyTorch.

The probability of choosing a random action will start at $p_0$ and will decay exponentially towards $p_{T}$, the rate of the decay is controlled by the step number $\tau$.

You can let Kokoyi to set up the initialization for the LSTM (just copy and paste and then fill up what's needed):

Click here

to see the default initialization code generated by Kokoyi for this model (You can use the button above to insert such a cell while at a Kokoyi cell):
class Q(torch.nn.Module):
    def __init__(self):
        """ Add your code for parameter initialization here (not necessarily the same names)."""
        super().__init__()
        self.Linears = None

    def get_parameters(self):
        """ Change the following code to return the parameters as a tuple in the order of (Linears)."""
        return None

    forward = kokoyi.symbol["Q"]
class loss(torch.nn.Module):
    def __init__(self):
        """ Add your code for parameter initialization here (not necessarily the same names)."""
        super().__init__()
        self._gamma = None

    def get_parameters(self):
        """ Change the following code to return the parameters as a tuple in the order of (\gamma)."""
        return None

    forward = kokoyi.symbol["loss"]
class SelectAction(torch.nn.Module):
    def __init__(self):
        """ Add your code for parameter initialization here (not necessarily the same names)."""
        super().__init__()
        self.p_0 = None
        self.p_T = None
        self._tau = None

    def get_parameters(self):
        """ Change the following code to return the parameters as a tuple in the order of (p_0, p_T, \tau)."""
        return None

    forward = kokoyi.symbol["SelectAction"]

Here's the completed module definitions:

One final important trick in DQN is that the training doesn't play out the actions and gather the rewards to optimize the networks. Instead, it deposits the experiences in the memory, and we train by sampling from it. This way the transitions that build up a batch are decorrelated, and gradient descent works better. We use the Transition class and ReplayMemory class from the reinforcement learning (RL) tutorial of PyTorch.

And here is how optimization takes transactions out of the memory:

The main wheel -- we simply let the agent plays along, and optimize from the experiences!

After training, we can reset the environment and have a test run to take a look at the effect.