# User Guide: Policy and algorithm development

After learning about configuring experiments and implementing new environments, you may want to get your hands on training your own policies for RL agents. In local version of **SRL**, we provide you with some implemented policies and algorithms in [legacy/algorithm/]("../../src/rlsrl/legacy/algorithm/"). You can use them directly in your trainer worker and policy worker, adjust parameters and train your own agents easily. However, our intention is to provide a platform for users to innovate with their own algorithms and policies. In this case, users can exploit our **Policy** and **Trainer** APIs. This section will give you a quick-start on how to use the APIs to implement your own ideas.


## Policy

### Natively supported actor-critic policies.
- If your are using PPO or PPG algorithm, you could use the actor-critic policies implemented in our system ([legacy/algorithm/ppo/actor_critic_policies/actor_critic_policy.py](../../src/rlsrl/legacy/algorithm/ppo/actor_critic_policies/actor_critic_policy.py))

Suppose you have observation space (187,), state_space (623,) and 16 discrete actions.
```python
Policy(type_="actor-critic-separate",
       args=dict(obs_dim={"obs": 187},
                 state_dim={"state": 623},
                 action_dim=16,
                 **other_kwargs)
```
will suffice in most cases.

- For more info and variants such as share backbone, auxiliary value head, and continuous actions, check out [legacy/algorithm/ppo/actor_critic_policies/readme.md](../../src/rlsrl/legacy/algorithm/ppo/actor_critic_policies/readme.md).

- Actor critic policies are known to have a shared pre-processing layer on common inputs, breaking the separate backbone initiatives. This is a bug and is to be fixed.

### Developing Your Own Policy

Developing your own policy includes following steps:
1. Implement your policy in a subdirectory of algorithm;
2. Consider subclassing `SingleModelPytorchPolicy` from algorithm/policy. It may save your effort to implement device assigning, DDP initialization, checkpoint_saving and version controlling.
3. Implement methods `default_policy_state`, `rollout` and `analyze` in class `Policy`.

#### `default_policy_state(self)`

Some policies have a memory unit, such as RNN. In the system we call the memory policy_state. When an agent spawns in an environment, it has no memory. And `default_policy_state` serves as a default value for the policy memory.

In the system, policy_state is implemented as `NamedArray`. E.g. for a recurrent policy, with a gru unit, it could be:

```python
@namedarray
class MyPolicyState:
    hx: np.ndarray

default_policy_state = MyPolicyState(
    hx=np.zeros((num_rnn_layers, 1, rnn_hidden_dim), dtype=np.float32)
)  # The second dimension is for batching. Here it is set to 1 as a system convention.
```

If your policy does not have a memory unit, `default_policy_state` can be None.
    
#### `rollout(self, requests, **kwargs)`

The rollout method takes in a `RolloutRequest` and returns a `RolloutResult`, the structure of which can both be found in [api/policy.py](../../src/rlsrl/api/policy.py). Both RolloutRequest and RolloutResult are named arrays.
```python
# algorithm/policy.py
class Policy:
    ...
    def rollout(self, requests: RolloutRequest, **kwargs) -> RolloutResult:
        """ Generate actions (and rnn hidden states) during evaluation.
        Args:
            requests: All request received from actor generated by env.step.
        Returns:
            RolloutResult: Rollout results to be distributed (namedarray).
        """
        raise NotImplementedError()
    ...
```

_The argument_ `requests` will provide sufficient information for you to run model inference, including:

- `obs`, as your environment implements.
- `policy_state`, as you just specified in this policy.
- `is_evaluation`, specifies whether the action is sampled deterministically or stochastically.

Please be aware that these values are batched along the first dimension. 
They come from different agent of different environments. 

[comment]: <> (- `on_reset`, whether this is the first step of a trajectory. )

A **rollout** method normally contains the following steps:

1. move `obs` and `policy_state` to GPU tensors;
2. run model forward pass;
3. sample actions from the result of 2;
4. if necessary, rearrange the deterministic and stochastic actions.
5. Assemble `RolloutResult` and return.

In most cases, we have to specify action and policy_state in RolloutResult. 
This policy_state is the memory state after this model forward pass. 
The agents, on the other end of the system, will pass their policy_state back to the policy on the next time they require an action.

Algorithm specific values can also be added. E.g. for ppo, we should add `log_prob` and maybe also values.

NOTE: The data structure of `RolloutRequest` and `RolloutResult` are subjects to change in the next few version. 
Algorithm specific value will be put into a sub-NamedArray.


#### `analyze(self, sample, target, **kwargs)`
Analyze method is closely related to the your training algorithm. The input value is a sample batch,
and the return value is determined by the training algorithm. E.g. if your training algorithm is PPO, 
the analyzed result should at least return `new_logprob`, `old_logprob` and `entropy`. This is formalized
by trainers specifying a dataclass named `AnalyzedResult`.


Some rule of thumbs:
- If you are **only customizing a policy**, checkout the AnalyzedResult specified by the trainers, and make sure the 
return value of your analyze result matching the trainer's requirement. 
- If you are **customizing both trainer and policy**, implement your trainer first. Pretend that the policy would return
just the values you need, formalize a AnalyzedResult and then implement analyze method of your policy.
- If your **customized trainer works with an old policy**, implement your trainer first. And then
try twinkling the `analyze` method of the policy. Note that there is a `target` argument that allows your trainer to tell
 the policy what analyze method to execute.

Unlike RolloutResult, AnalyzedResult is a classic python dataclass. The data fields can be of any datatype, 
and their dimensions doesn't have to match, as long as your trainer can properly update the model parameters.

#### Other utility methods

In some cases, e.g. if your model contains more than a simple pytorch neural network,
implementing `get_checkpoint` and `load_checkpoint` is necessary. Just note that the return value of `get_checkpoint` 
has to be pytorch-savable.

## Trainer

When trainers are initialized, a policy must be specified. 
If confused, Think of trainers as optimizers and policy as neural networks.
If confused still, read [Trainer and Policy](#trainer-and-policy-optional-reading).

### Step

Let the code speak for itself.
```python
# algorithm/trainer.py
class Trainer:

    @property
    def policy(self) -> api.policy.Policy:
        """Running policy of the trainer.
        """
        raise NotImplementedError()

    def step(self, samples: SampleBatch) -> TrainerStepResult:
        """Advances one training step given samples collected by actor workers.

        Example code:
          ...
          some_data = self.policy.analyze(sample)
          loss = loss_fn(some_data, sample)
          self.optimizer.zero_grad()
          loss.backward()
          ...
          self.optimizer.step()
          ...

        Args:
            samples (SampleBatch): A batch of data required for training.

        Returns:
            TrainerStepResult: Entry to be logged by trainer worker.
        """
        raise NotImplementedError()

```

## Trainer and Policy (Optional Reading)

It is natural to think of the "neural network" and the "optimizer" as a whole, especially in local projects. We compute action with some neural network(NN), and optimize directly on the same NN. But in a distributed system, trainer and policy must be decoupled. When trainer is updating the parameters, our policy cannot stop functioning.

To summarize the relation between trainer and policy in our system:

- Agents make decision through policies.
- Trainers optimize policies.

Then the method `rollout` and `analyze` comes in naturally for policies. As a metaphor, they are like humans making decisions(rollout), and at the end of day, reflecting on what we did right or wrong. Note that for `analyze`, we expect policies to tell us, what was good or bad, but it is the trainers who decide what to do in the next day. Trainers, seeing the rights and wrongs, behave more like a methodology, or philosophy: 
"Do more rights and fewer wrongs".

## Registering

Before applying your own **Policy** and **Trainer** class, registering is also required. You may register policy with `rlsrl.api.policy.register(name, policy_class)` method and trainer with `rlsrl.api.trainer.register(name, trainer_class)`. You should also include your trainer and policy files in argument `--import_files` when running commands.


# Related References
- [System Components: Policy Worker](../03_policy_worker.md)
- [System Components: Trainer Worker](../04_trainer_worker.md)

# Related Files and Directories
- [api/policy.py](../../src/rlsrl/api/policy.py)
- [api/trainer.py](../../src/rlsrl/api/trainer.py)

# What's Next
- [System Components: Overview](../00_system_overview.md)