We provide two Kokoyi notebooks for reinforcement learning. The first is policy gradient and the second is Q learning. In both cases, we will train an agent to balance a pole in the CartPole-v0 task from the OpenAI Gym in Kokoyi. In the CartPole problem, the cart is pushed to the right or left by a force of +1 or -1, and the goal is to prevent the attached pole from falling over.
At each step $t$, our agent has to decide on an action $a_t \in \{0, 1\}$ - moving the cart left or right - based on the current state $s_t$ of the environment. The state space is a 4-D space, i.e. $s_t \in \mathbb{R}^{4}$ that describes the state of the cart, and they are:
Num | Observation | Min | Max |
---|---|---|---|
0 | Cart Position | -2.4 | 2.4 |
1 | Cart Velocity | -Inf | Inf |
2 | Pole Angle | ~ -41.8° | ~ 41.8° |
3 | Pole Velocity At Tip | -Inf | Inf |
Given the action $a_t$, the environment will transition to a new state $s_{t+1}$ and also return a reward $r_t \in \{0, +1\}$ that indicates the consequences of the action. That is, a reward of +1 is provided for every timestep that the pole remains upright, and 0 means the environment terminates i.e. the pole tips too far or the cart moves too far away from center. A good policy (of the agent), called $\pi_\theta$, balances the pole as long as it can. The policy tells agent which action $a$ to take in state $s$ by output a conditional probability distribution over actions: $\pi(a|s; \theta) = p_{\theta}(A=a|S=s)$.
The goal of RL is to find $\pi^*$, the optimal policy, that maximizes total rewards. Let's first set up the environment:
import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from collections import namedtuple, deque
from itertools import count
from PIL import Image
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical
import kokoyi
# Set up the CartPole environment
import gym
env = gym.make('CartPole-v0')
Some more utility function (i.e. plotting) and setups.
plt.ion()
episode_durations = []
def plot_durations():
plt.figure(2)
plt.clf()
durations_t = torch.tensor(episode_durations, dtype=torch.float)
plt.title('Training...')
plt.xlabel('Episode')
plt.ylabel('Duration')
plt.ylim(0, 200+5)
plt.plot(durations_t.numpy())
plt.pause(0.001)
if torch.cuda.is_available():
device = torch.device('cuda')
else:
device = torch.device('cpu')
kokoyi.set_rt_device(device)
print('Using device: ', device)
To find an optimal behavior strategy for the agent, PG (policy gradient) tries to model and optimize the policy $\pi(a|s)$ directly. Here we use simple multilayer perceptron (MLP) as the policy $\pi(a|s)$, it takes the state $s$ as input, output probability distribution of action $a$:
%kokoyi
\Module{\pi}{s; Linears}
\Return \Softmax(Linears[1](\ReLU(Linears[0](s)))) \\
\EndModule
Policy Gradient is a classic RL method that optimizes parameterized policy with respect to the expected return by gradient descent. In general, the expected return for policy parameter $\theta$ is a discounted sum of all future rewards:
where $G_t$ is the discounted future reward for timestep $t$, i.e., $G_t=\sum_{i=0}^{\infty}\gamma^i r_{t+i}$. You can see proof here.
(A side note: you might notice some resemblance to a classifier using binary cross-entropy loss, except the reward now substitutes the position of the label.)
Writing the PG loss in Kokoyi is easy, let's take the log probability for $\pi_\theta(a_t|s_t)$ as the input $logP$. One known issue with a vanilla PG is that it can suffer high variance, so usually you will substract a baseline, which is simply the average reward.
%kokoyi
\Module{PGLoss}{logP, r; \gamma}
\mu \gets \Mean \\
\sigma(x) \gets \sqrt{\frac{\sum_{i=0}^{|x|-1}{(x[i] -\bar{x})**2}}{|x|}} \where \bar{x} \gets \mu(x) \Comment{Compute standard deviation} \\
T \gets |r| \\
G_t \gets \{\sum_{i=0}^{T - 1 - t}{\gamma ** i * r[t+i]}\}_{t=0}^{T-1} \Comment{Discounted reward}\\
\bar{G_t} \gets \frac{G_t - \mu(G_t)}{\sigma(G_t)} \Comment{Normalize discounted reward}\\
\Return -\bar{G_t} * logP \\
\EndModule
Note that Policy Gradient aims at maximize $J(\theta)$ (i.e. an ascend), so the minus sign in the Return statement is required so the main training loop can use stochastic gradient descend.
You can let Kokoyi to set up the initialization for the LSTM (just copy and paste and then fill up what's needed):
class pi(torch.nn.Module): def __init__(self): super().__init__() # Change the codes below to initialize module members. self.Linears = None def get_parameters(self): # Return module members in its declaration order. return self.Linears forward = kokoyi.symbol[r"pi"] class PGLoss(torch.nn.Module): def __init__(self): super().__init__() # Change the codes below to initialize module members. self.gamma = None def get_parameters(self): # Return module members in its declaration order. return self.gamma forward = kokoyi.symbol[r"PGLoss"]
Here's the completed module definitions:
class Pi(torch.nn.Module):
def __init__(self, input_dim, num_actions):
super().__init__()
self.Linears = torch.nn.ModuleList([
kokoyi.nn.Linear(input_dim, 128),
kokoyi.nn.Linear(128, num_actions)
])
def get_parameters(self):
return self.Linears
forward = kokoyi.symbol[r"\pi"]
class PGLoss(torch.nn.Module):
def __init__(self, gamma):
super().__init__()
self.gamma = gamma
def get_parameters(self):
return self.gamma
forward = kokoyi.symbol[r'PGLoss']
Our training loop follows the pseudo code
input_dim = env.observation_space.shape[0]
num_actions = env.action_space.n
gamma = 0.99
pi = Pi(input_dim, num_actions).to(device=device)
pgloss = PGLoss(gamma).to(device=device)
optimizer = optim.Adam(pi.parameters(), lr=0.01)
epochs = 1000
for epoch in range(epochs):
state = env.reset()
done = False
log_probs = []
rewards = []
for t in count():
action_probs = pi(torch.tensor(state, dtype=torch.float, device=device), batch_level=[0])
c = Categorical(action_probs)
# choose action based on the probability distribution
action = c.sample()
log_prob = c.log_prob(action)
state, reward, done, _ = env.step(action.item())
log_probs.append(log_prob.reshape(1))
rewards.append(reward)
if done:
log_probs = torch.cat(log_probs) # (steps,)
rewards = torch.tensor(rewards, device=device)# (steps,)
loss = pgloss(log_probs, rewards, batch_level=[0,0])
loss = torch.sum(loss)
# Optimize the model
optimizer.zero_grad()
loss.backward()
optimizer.step()
episode_durations.append(t+1)
plot_durations()
break
plt.ioff()
After training, we can reset the environment and have a test run to take a look at the effect.
import time
state = env.reset()
state = torch.tensor(state, device=device, dtype=torch.float)
env.render()
for t in count():
env.render()
action_probs = pi(torch.tensor(state, dtype=torch.float, device=device), batch_level=[0])
action = torch.argmax(action_probs)
state, _, done, _ = env.step(action.item())
time.sleep(0.0001)
if done:
print('Duration is %d' %(t+1))
break