Reinforcement Learning Part 1: Policy Gradient

We provide two Kokoyi notebooks for reinforcement learning. The first is policy gradient and the second is Q learning. In both cases, we will train an agent to balance a pole in the CartPole-v0 task from the OpenAI Gym in Kokoyi. In the CartPole problem, the cart is pushed to the right or left by a force of +1 or -1, and the goal is to prevent the attached pole from falling over.

jupyter

At each step $t$, our agent has to decide on an action $a_t \in \{0, 1\}$ - moving the cart left or right - based on the current state $s_t$ of the environment. The state space is a 4-D space, i.e. $s_t \in \mathbb{R}^{4}$ that describes the state of the cart, and they are:

Num Observation Min Max
0 Cart Position -2.4 2.4
1 Cart Velocity -Inf Inf
2 Pole Angle ~ -41.8° ~ 41.8°
3 Pole Velocity At Tip -Inf Inf

Given the action $a_t$, the environment will transition to a new state $s_{t+1}$ and also return a reward $r_t \in \{0, +1\}$ that indicates the consequences of the action. That is, a reward of +1 is provided for every timestep that the pole remains upright, and 0 means the environment terminates i.e. the pole tips too far or the cart moves too far away from center. A good policy (of the agent), called $\pi_\theta$, balances the pole as long as it can. The policy tells agent which action $a$ to take in state $s$ by output a conditional probability distribution over actions: $\pi(a|s; \theta) = p_{\theta}(A=a|S=s)$.

The goal of RL is to find $\pi^*$, the optimal policy, that maximizes total rewards. Let's first set up the environment:

Some more utility function (i.e. plotting) and setups.

Policy Gradient in Kokoyi

To find an optimal behavior strategy for the agent, PG (policy gradient) tries to model and optimize the policy $\pi(a|s)$ directly. Here we use simple multilayer perceptron (MLP) as the policy $\pi(a|s)$, it takes the state $s$ as input, output probability distribution of action $a$:

Policy Gradient is a classic RL method that optimizes parameterized policy with respect to the expected return by gradient descent. In general, the expected return for policy parameter $\theta$ is a discounted sum of all future rewards:

$J(\theta) = E[\sum_{t=0}^{\infty}\gamma^tr_t]$
where $\gamma$ is the discount factor, $r_t$ is the reward for timestep $t$, so that the longer the pole remains upright, the greater the reward is. We can update the policy parameter $\theta$ according to the gradient update rule:
$\theta_{k+1} = \theta_{k} + \alpha\nabla_\theta J(\theta)|_{\theta=\theta_k}$
where $\alpha$ denotes the learning rate and $k$ the current training epoch. However it's hard to compute the policy gradient $\nabla_\theta J(\theta)|_{\theta=\theta_k}$. REINFORCE (Monte-Carlo policy gradient) provides an estimation using the likelihood ratio. Assume that we can generate one trajectory $\tau=\{s_1, a_1, s_2, a_2, \dots, s_t\}$ on policy $\pi_\theta$(take a Monte Carlo simulation of one episode with the policy), we can simplify the gradient computation a lot:
$\nabla_\theta J(\theta) = G_t\nabla_\theta \ln \pi_\theta(a_t|s_t)$

where $G_t$ is the discounted future reward for timestep $t$, i.e., $G_t=\sum_{i=0}^{\infty}\gamma^i r_{t+i}$. You can see proof here.

(A side note: you might notice some resemblance to a classifier using binary cross-entropy loss, except the reward now substitutes the position of the label.)

Writing the PG loss in Kokoyi is easy, let's take the log probability for $\pi_\theta(a_t|s_t)$ as the input $logP$. One known issue with a vanilla PG is that it can suffer high variance, so usually you will substract a baseline, which is simply the average reward.

Note that Policy Gradient aims at maximize $J(\theta)$ (i.e. an ascend), so the minus sign in the Return statement is required so the main training loop can use stochastic gradient descend.

You can let Kokoyi to set up the initialization for the LSTM (just copy and paste and then fill up what's needed):

Click here

to see the default initialization code generated by Kokoyi for this model (You can use the button above to insert such a cell while at a Kokoyi cell):
class pi(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # Change the codes below to initialize module members.
        self.Linears = None

    def get_parameters(self):
        # Return module members in its declaration order.
        return self.Linears

    forward = kokoyi.symbol[r"pi"]


class PGLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # Change the codes below to initialize module members.
        self.gamma = None

    def get_parameters(self):
        # Return module members in its declaration order.
        return self.gamma

    forward = kokoyi.symbol[r"PGLoss"]

Here's the completed module definitions:

Our training loop follows the pseudo code

After training, we can reset the environment and have a test run to take a look at the effect.