# AR-A2C (Average-Reward A2C, Anchored) — Pendulum-v1

Minimal Average-Reward A2C with RVI-style anchoring $h(s_\text{ref})=0$. Trains on `Gymnasium` `Pendulum-v1`, logs (optional W&B), exports value heatmaps, records a test video.

## Files

* `utils.py` – seed & init helpers
* `networks.py` – `Actor`, `Critic_r`, residual block
* `agent.py` – train/test loops, updates, heatmap
* `train.py` – CLI entry

## Install

```bash
pip install torch gymnasium "gymnasium[classic-control]" numpy matplotlib
# optional
pip install wandb
```

## Run

```bash
python train.py --num_frames 3000000 --seed 797 \
  --actor_lr 1e-5 --critic_lr 7e-5 --rho_lr 1e-4 \
  --heatmap_every 50000 --log_every 100 --anchor_every 1 \
  --wandb_project ""
```

## Outputs

* `heatmaps_ar/` — V(s) heatmaps
* `videos/a2c_avg_reward_test/` — recorded episode

## Notes

* Disable W&B by setting `--wandb_project ""`.
* Anchoring suppresses value drift; keep `--anchor_every >= 1`.