This notebook continues the policy gradient notebook with Q learning. We will use the CartPole-v0 task. To make the notebook self-contained, we repeat the background cells below.
In the CartPole problem, the cart is pushed to the right or left by a force of +1 or -1, and the goal is to prevent the attached pole from falling over.
At each step $t$, our agent has to decide on an action $a_t \in \{0, 1\}$ - moving the cart left or right - based on the current state $s_t$ of the environment. The state space is a 4-D space, i.e. $s_t \in \mathbb{R}^{4}$ that describes the state of the cart, and they are:
Num | Observation | Min | Max |
---|---|---|---|
0 | Cart Position | -2.4 | 2.4 |
1 | Cart Velocity | -Inf | Inf |
2 | Pole Angle | ~ -41.8° | ~ 41.8° |
3 | Pole Velocity At Tip | -Inf | Inf |
Given the action $a_t$, the environment will transition to a new state $s_{t+1}$ and also return a reward $r_t \in \{0, +1\}$ that indicates the consequences of the action. That is, a reward of +1 is provided for every timestep that the pole remains upright, and 0 means the environment terminates i.e. the pole tips too far or the cart moves too far away from center. A good policy (of the agent), called $\pi_\theta$, balances the pole as long as it can. The policy tells agent which action $a$ to take in state $s$ by output a conditional probability distribution over actions: $\pi(a|s; \theta) = p_{\theta}(A=a|S=s)$.
The goal of RL is to find $\pi^*$, the optimal policy, that maximizes total rewards. Let's first set up the environment:
import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from collections import namedtuple, deque
from itertools import count
from PIL import Image
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import kokoyi
# Set up the CartPole environment
import gym
env = gym.make('CartPole-v0')
Some more utility function (i.e. plotting) and setups.
plt.ion()
episode_durations = []
def plot_durations():
plt.figure(2)
plt.clf()
durations_t = torch.tensor(episode_durations, dtype=torch.float)
plt.title('Training...')
plt.xlabel('Episode')
plt.ylabel('Duration')
plt.ylim(0, 200+5)
plt.plot(durations_t.numpy())
plt.pause(0.001)
if torch.cuda.is_available():
device = torch.device('cuda')
else:
device = torch.device('cpu')
kokoyi.set_rt_device(device)
print('Using device: ', device)
Recall policy gradient notebook we have described an approach to directly optimize a policy online, meaning we learn while we explore. Q-learning takes a different approach.
Just as in PG, Q-learning's goal is to estimate an optimal action-value function $Q^*: \text{State} \times \text{Action} \rightarrow \mathbb{R}$, that can tell us the maximum expected return by taking an action $a$ in a given state $s$. Give $Q^*$, constructing the optimal policy $\pi^*$ is easy, since all need to do is to pick the action that maximizes our rewards:
$$ \pi^*(s) = \arg \max_a Q^*(s,a)$$Therefore, the task reduces to learning $Q^*$, this tutorial uses the algorithm in Human-level control through deep reinforcement learning, known as DQN, where $Q^*$ is a neural network with learnable parameters $\theta$ takes $s$ and $a$ as inputs and output a vector that maps the actions (i.e. left or right) to their corresponding $Q$ value. In this example, a simple multilayer perceptron (MLP) is used as the $Q(s, a; \theta)$. Writing it out in Kokoyi is easy, note the parameters are encapsulated in the two-layer MLP passed in after ";". Note that 1) $a$ is missing since we compute a vector indexed by $a$ and 2) we don't unpack the MLP module but directly reference them for the computation (you can certainly do that, too): <!-- $$ Q^*(s, \text{left}), Q^*(s, \text{right}) = DQN(s; \theta) $$
In this example, a simple multilayer perceptron (MLP) is used as the $DQN$. -->
%kokoyi
\Module{Q}{s; Linears}
\Return Linears[1](\ReLU(Linears[0](s))) \\
\EndModule
ANTLR runtime and generated code versions disagree: 4.8!=4.7.2 ANTLR runtime and generated code versions disagree: 4.8!=4.7.2
The optimal action-value function $Q^*$ obeys an important invariant known as the Bellman equation:
$$ Q^*(s,a) = r + \gamma \max_{a'}Q^*(s_{next}, a')$$where $s$, $s_{next}$ is the current state and the next state, $r$ is the reward of the action $a$. In plain English, this basically says that the optimal action-value function of the action $a$ is the immediate reward $r$, plus the best value among all the actions in the next state that the agent will be in, and the second term is multiplied by a discount factor $\gamma$. $\gamma$ describes how greedy we want the policy to be, if $\gamma$ is 0 then the agent is only interested in the immediate reward at this step, ignoring the future, whereas larger $\gamma$ forces the agent to pay attention to future return.
Since $Q(s;\theta)$ is a function approximate to $Q^*$, we can train the network by minimizing the loss function $L(\theta)$,
$$ L(\theta) = [Q(s; \theta, a) - (r + \gamma\max_{a'}Q(s_{next}; \theta^{-}, a'))]^2 $$In DQN, we have two networks, the main network $Q$ is where we compute the loss and update, and a slowly updated target network $Q_t$ to compute the target value (the second term in the loss function). These two networks are identical other than their parameters: $Q$ has $\theta$ whereas the $Q_t$ has $\theta^{-}$. $\theta^{-}$ is frozen most of the time, but takes a copy of $\theta$ every $C$ steps.
The Kokoyi code is straightforward, we use mean square error to compute the loss.
%kokoyi
\Module {DQNLoss} {Q, Q_t, s, s_{next}, a, r; \gamma}
\Return \MSELoss(Q(s)[a], r + \gamma * \Max(Q_t(s_{next}))) \\
\EndModule
Now let's define the process that an agent picks an action. If our agent always selects the action that gives the maximum (estimated) return, it may fail to explore new options which might pay off in the long term. This is not going to be fixed by a larger discount factor $\gamma$ since all it does is punishing all actions in the next state uniformly.
DQN employs another classical trick called epsilon greedy algorithm, which balances exploration (of unknown territories) and exploitation (of the current best options). The idea is simple, with a controlled parameter, the agent will just pick an action randomly. See relevant section describing the epsilon greedy policy in the reinforcement learning (RL) tutorial of PyTorch.
The probability of choosing a random action will start at $p_0$ and will decay exponentially towards $p_{T}$, the rate of the decay is controlled by the step number $\tau$.
%kokoyi
\Module{SelectAction}{s, Q, t; p_0, p_T, \tau}
c \gets \Rand(1) \\
p_t \gets p_T + \exp(-\frac{t}{\tau}) * (p_0 - p_T) \\
v \gets Q(s) \\
a \gets \begin{cases}
\Argmax(v) & c > p_t \\
\RandInt(0, |v|) & otherwise \\
\end{cases}\\
\Return a \\
\EndModule
You can let Kokoyi to set up the initialization for the LSTM (just copy and paste and then fill up what's needed):
class Q(torch.nn.Module): def __init__(self): """ Add your code for parameter initialization here (not necessarily the same names).""" super().__init__() self.Linears = None def get_parameters(self): """ Change the following code to return the parameters as a tuple in the order of (Linears).""" return None forward = kokoyi.symbol["Q"]
class loss(torch.nn.Module): def __init__(self): """ Add your code for parameter initialization here (not necessarily the same names).""" super().__init__() self._gamma = None def get_parameters(self): """ Change the following code to return the parameters as a tuple in the order of (\gamma).""" return None forward = kokoyi.symbol["loss"]
class SelectAction(torch.nn.Module): def __init__(self): """ Add your code for parameter initialization here (not necessarily the same names).""" super().__init__() self.p_0 = None self.p_T = None self._tau = None def get_parameters(self): """ Change the following code to return the parameters as a tuple in the order of (p_0, p_T, \tau).""" return None forward = kokoyi.symbol["SelectAction"]
Here's the completed module definitions:
from kokoyi.nn import Linear
class Q(torch.nn.Module):
def __init__(self, input_dim, num_actions):
super().__init__()
self.Linears = torch.nn.ModuleList([
Linear(input_dim, 128),
Linear(128, num_actions)
])
def get_parameters(self):
return self.Linears
forward = kokoyi.symbol["Q"]
class DQNLoss(torch.nn.Module):
def __init__(self, gamma):
super().__init__()
self.gamma = gamma
def get_parameters(self):
return self.gamma
forward = kokoyi.symbol['DQNLoss']
class SelectAction(torch.nn.Module):
def __init__(self, p_0, p_T, tau):
super().__init__()
self.p_0 = p_0
self.p_T = p_T
self._tau = tau
def get_parameters(self):
return self.p_0, self.p_T, self._tau
forward = kokoyi.symbol['SelectAction']
One final important trick in DQN is that the training doesn't play out the actions and gather the rewards to optimize the networks. Instead, it deposits the experiences in the memory, and we train by sampling from it. This way the transitions that build up a batch are decorrelated, and gradient descent works better. We use the Transition class and ReplayMemory class from the reinforcement learning (RL) tutorial of PyTorch.
Transition = namedtuple('Transition',
('state', 'action', 'next_state', 'reward'))
class ReplayMemory(object):
def __init__(self, capacity):
self.memory = deque([],maxlen=capacity)
def push(self, *args):
"""Save a transition"""
self.memory.append(Transition(*args))
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
def __len__(self):
return len(self.memory)
And here is how optimization takes transactions out of the memory:
input_dim = env.observation_space.shape[0]
num_actions = env.action_space.n
C = 100
batch_size = 32
gamma = 0.9
p_0, p_T, tau = 1.0, 0.1, 200
policy_Q = Q(input_dim, num_actions).to(device)
target_Q = Q(input_dim, num_actions).to(device)
action_loss = DQNLoss(gamma).to(device)
action_selector = SelectAction(p_0, p_T, tau).to(device)
optimizer = optim.Adam(policy_Q.parameters(), lr = 0.01)
replay_buffer = ReplayMemory(capacity=2000)
learn_step = 0
def optimize_model():
if len(replay_buffer) < batch_size:
return
# Sample a batch of data from replay memory
transitions = replay_buffer.sample(batch_size)
batch = Transition(*zip(*transitions))
state_batch = torch.cat(batch.state)
action_batch = torch.cat(batch.action)
reward_batch = torch.cat(batch.reward)
next_state_batch = torch.cat(batch.next_state)
# Compute the loss
loss = action_loss(policy_Q, target_Q,
state_batch, next_state_batch, action_batch, reward_batch,
batch_level=[0,0,1,1,1,1])
loss = loss.mean()
optimizer.zero_grad()
loss.backward()
optimizer.step()
global learn_step
learn_step += 1
# Update the target network with the policy network's parameters every C steps.
if learn_step % C == 0:
target_Q.load_state_dict(policy_Q.state_dict())
The main wheel -- we simply let the agent plays along, and optimize from the experiences!
num_episodes = 50
step_num = 0
for i_episode in range(num_episodes):
state = env.reset()
state = torch.tensor([state], device=device, dtype=torch.float)
for t in count():
action = action_selector(state[0], policy_Q, float(step_num)).cpu() #tensor([1])
step_num += 1
next_state, actual_reward, done, _ = env.step(action.item())
# Apply a novel reward to speedup the training
x, v, Angle, Angle_v = next_state
r1 = (env.x_threshold - abs(x)) / env.x_threshold - 0.8 # Control the position of the cart
r2 = (env.theta_threshold_radians - abs(Angle)) / env.theta_threshold_radians - 0.5 # Control the Angle of the pole
reward = r1 + r2
reward = torch.tensor([reward], dtype=torch.float, device=device)
next_state = torch.tensor([next_state], dtype=torch.float, device=device)
action = torch.tensor([action], dtype=torch.long, device=device)
# Add the new Transition to replay memory
replay_buffer.push(state, action, next_state, reward)
state = next_state
optimize_model()
if done or t == 200:
episode_durations.append(t + 1)
plot_durations()
break
print('Train Complete')
plt.ioff()
plt.savefig("Train_duration.png")
After training, we can reset the environment and have a test run to take a look at the effect.
import time
state = env.reset()
state = torch.tensor([state], device=device, dtype=torch.float)
env.render()
for t in count():
env.render()
action = action_selector(state[0], policy_Q, step_num).cpu()
next_state, _, done, _ = env.step(action.item())
state = torch.tensor([next_state], dtype=torch.float)
time.sleep(0.0001)
if done:
print('Duration is %d' %(t+1))
break