# Transformer PPO

Adapted implementation of PPO-TrXL for StochNASim. With hyperparameter tuning.

The basis we use comes from [CleanRL](https://docs.cleanrl.dev/rl-algorithms/ppo-trxl/). PPO-TrXL was created and implemented by Marco Plaines et al. for their work titled: [Memory Gym: Towards Endless Tasks to Benchmark Memory Capabilities of Agents](https://arxiv.org/abs/2309.17207).


### Installation

Installation can be done via pip:
```
pip install requirements-memory-gym.txt
```

### Exmaple Usage

Normal training
```
python ppo_trxl.py \
        --exp-name hpo_eval \
        --env-id GenPO-v0 \
        --num-envs 8 \
        --num-steps 768 \
        --total-timesteps 10050000 \
        --num-evals 20 \
        --eval-freq 500000 \
        --num-eval-envs 8 \
        --num-eval-episodes 100 \
        --anneal-steps 4020000 \
        --clip-coef 0.1 \
        --init-ent-coef 0.0001 \
        --update-epochs 4 \
        --gae-lambda 0.95 \
        --gamma 0.995 \
        --init-lr 0.0002 \
        --max-grad-norm 0.5 \
        --num-minibatches 4 \
        --trxl-memory-length 512 \
        --trxl-num-heads 1 \
        --trxl-positional-encoding "" \
        --trxl-dim 256 \
        --trxl-num-layers 4 \
        --vf-coef 0.3 \
        --cuda \
        --save-model \
        --seed 2
```

Hyperparameter tuning:
```
python hyperparams_search.py \
        --env-id StochPO-v0 \
        --num-envs 8 \
        --num-steps 768 \
        --total-timesteps 5000000 \
        --db-url <place URL to Optune database here>\
        --trials  75 \
        --max-total-trials 250 \
        --study-name ppo_trxl_genpo \
        --pruner-warmup-steps 1900000 \
        --num-evals 5 \
        --num-eval-envs 8 \
        --num-eval-episodes 100 \
        --anneal-steps 4020000
```
