# ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL

## Overview

We propose **ELMUR** (External Layer Memory with Update/Rewrite for Long-Horizon RL), a novel memory-augmented transformer architecture designed for long-horizon reinforcement learning tasks. Our code is integrated into the RATE framework for offline RL, and the ELMUR model implementation can be found in `ELMUR/offline_rl_baselines/ELMUR/model.py`.


## Requirements

```bash
# Install main dependencies
pip install -e .

# Install additional dependencies for POPGym (download datasts from RATE paper)
pip install -r requirements/requirements_popgym.txt

# Install additional dependencies for POPGym (download datasets from MIKASA-Robo paper)
pip install mikasa_robo_suite
```


## Training Commands

### T-Maze Environment

```bash
python src/train.py \
    --data.gamma=1 \
    --data.max-length=None \
    --data.path-to-dataset=None \
    --dtype=float32 \
    --end-seed=1 \
    --max-n-final=3 \
    --min-n-final=1 \
    --model-mode=ELMUR \
    --model.act-dim=4 \
    --model.d-ff=128 \
    --model.d-model=128 \
    --model.detach-memory=True \
    --model.dropatt=0.71 \
    --model.dropout=0.10 \
    --model.env-name=tmaze \
    --model.label-smoothing=0.16 \
    --model.load-balancing-loss-coef=0.1 \
    --model.lru-blend-alpha=0.05 \
    --model.max-seq-len=1024 \
    --model.memory-dropout=0.01 \
    --model.memory-init-std=0.001 \
    --model.memory-size=2 \
    --model.n-head=2 \
    --model.n-layer=2 \
    --model.n-shared-experts=2 \
    --model.num-experts=2 \
    --model.padding-idx=-10 \
    --model.pos-type=relative \
    --model.pre-lnorm=False \
    --model.routed-d-ff=32 \
    --model.sequence-format=s \
    --model.shared-d-ff=512 \
    --model.state-dim=4 \
    --model.top-k=3 \
    --model.use-causal-self-attn-mask=True \
    --model.use-lru=True \
    --model.use-moe=True \
    --model.use-shared-expert=True \
    --model.use-swiglu=False \
    --online-inference.best_checkpoint_metric=Success_rate_9600 \
    --start-seed=1 \
    --tensorboard-dir=runs/TMaze/myrepo/ELMUR/T_30 \
    --text=myrepo \
    --training.batch-size=128 \
    --training.beta-1=0.95 \
    --training.beta-2=0.999 \
    --training.ckpt-epoch=200 \
    --training.context-length=10 \
    --training.epochs=1000 \
    --training.final-tokens=10000000 \
    --training.grad-norm-clip=5 \
    --training.learning-rate=0.00021 \
    --training.log-last-segment-loss-only=True \
    --training.lr-end-factor=1 \
    --training.online-inference=True \
    --training.sections=3 \
    --training.use-cosine-decay=True \
    --training.warmup-steps=10000 \
    --training.weight-decay=0.0001 \
    --wandb.project-name=ELMUR-T-Maze \
    --wandb.wwandb=True &
```

### POPGym Environment

```bash
python src/train.py \
    --data.gamma=1 \
    --data.max-length=105 \
    --data.path-to-dataset=data/POPGym/popgym-AutoencodeEasy-v0 \
    --dtype=float32 \
    --end-seed=1 \
    --model-mode=ELMUR \
    --model.act-dim=4 \
    --model.d-ff=128 \
    --model.d-model=64 \
    --model.detach-memory=True \
    --model.dropatt=0.25754897278876754 \
    --model.dropout=0.14 \
    --model.env-name=popgym-AutoencodeEasy \
    --model.label-smoothing=0.22 \
    --model.load-balancing-loss-coef=0.1 \
    --model.lru-blend-alpha=0.80 \
    --model.max-seq-len=1024 \
    --model.memory-dropout=0.17 \
    --model.memory-init-std=0 \
    --model.memory-size=8 \
    --model.n-head=4 \
    --model.n-layer=12 \
    --model.n-shared-experts=2 \
    --model.norm-type=rmsnorm \
    --model.num-experts=1 \
    --model.padding-idx=-10 \
    --model.pos-type=relative \
    --model.pre-lnorm=False \
    --model.routed-d-ff=128 \
    --model.sequence-format=s \
    --model.shared-d-ff=256 \
    --model.state-dim=-1 \
    --model.top-k=1 \
    --model.use-causal-self-attn-mask=True \
    --model.use-lru=True \
    --model.use-moe=True \
    --model.use-shared-expert=True \
    --model.use-swiglu=False \
    --online-inference.best_checkpoint_metric=ReturnsMean_1.0 \
    --online-inference.desired-return-1=1 \
    --online-inference.episode-timeout=1001 \
    --online-inference.use-argmax=False \
    --start-seed=1 \
    --tensorboard-dir=runs/POPGym/AutoencodeEasy-v0 \
    --text=iclr-2026 \
    --training.batch-size=128 \
    --training.beta-1=0.99 \
    --training.beta-2=0.99 \
    --training.ckpt-epoch=50 \
    --training.context-length=35 \
    --training.epochs=800 \
    --training.final-tokens=10000000 \
    --training.grad-norm-clip=5 \
    --training.learning-rate=0.00012 \
    --training.log-last-segment-loss-only=False \
    --training.lr-end-factor=0.01 \
    --training.online-inference=True \
    --training.sections=3 \
    --training.use-cosine-decay=False \
    --training.warmup-steps=50000 \
    --training.weight-decay=0.1 \
    --wandb.project-name=ELMUR-POPGym \
    --wandb.wwandb=True &
```

### MIKASA-Robo Environment

```bash
python3 src/train.py \
    --data.gamma=1 \
    --data.path-to-dataset=data/data_mikasa_robo/MIKASA-Robo/unbatched/RememberColor3-v0 \
    --dtype=float32 \
    --end-seed=1 \
    --model-mode=ELMUR \
    --model.act-dim=8 \
    --model.d-ff=128 \
    --model.d-model=128 \
    --model.detach-memory=True \
    --model.dropatt=0.30 \
    --model.dropout=0.1266135150715325 \
    --model.env-name=mikasa_robo_RememberColor3-v0 \
    --model.label-smoothing=0.21 \
    --model.load-balancing-loss-coef=0.1 \
    --model.lru-blend-alpha=0.41 \
    --model.max-seq-len=1024 \
    --model.memory-dropout=0.055 \
    --model.memory-init-std=0.1 \
    --model.memory-size=256 \
    --model.n-head=16 \
    --model.n-layer=4 \
    --model.n-shared-experts=1 \
    --model.num-experts=16 \
    --model.padding-idx=None \
    --model.pos-type=relative \
    --model.pre-lnorm=False \
    --model.routed-d-ff=128 \
    --model.sequence-format=s \
    --model.shared-d-ff=128 \
    --model.state-dim=6 \
    --model.top-k=2 \
    --model.use-causal-self-attn-mask=True \
    --model.use-lru=True \
    --model.use-moe=True \
    --model.use-shared-expert=True \
    --model.use-swiglu=False \
    --online-inference.best_checkpoint_metric=success_once \
    --online-inference.desired-return-1=60 \
    --online-inference.episode-timeout=60 \
    --online-inference.use-argmax=True \
    --start-seed=1 \
    --tensorboard-dir=runs/MIKASA_Robo/RememberColor3-v0 \
    --text=myrepo-v2 \
    --training.batch-size=64 \
    --training.beta-1=0.99 \
    --training.beta-2=0.99 \
    --training.ckpt-epoch=20 \
    --training.context-length=20 \
    --training.epochs=200 \
    --training.final-tokens=10000000 \
    --training.grad-norm-clip=5 \
    --training.learning-rate=0.00021 \
    --training.log-last-segment-loss-only=False \
    --training.lr-end-factor=0.1 \
    --training.online-inference=True \
    --training.sections=3 \
    --training.use-cosine-decay=True \
    --training.warmup-steps=30000 \
    --training.weight-decay=0.001 \
    --wandb.project-name=ELMUR-MIKASA-Robo \
    --wandb.wwandb=True
```