Hugging Face Video Encoder Integration
--------------------------------------

This repo can optionally use a pretrained Hugging Face image/video encoder as the feature extractor in `sac_procgen.py`.

Install
- Ensure `transformers` is installed. This repo adds it to `requirements.txt`, but you can also do:
  - `pip install transformers`

Usage
- Enable by passing a model id via CLI:
  - Example (image encoder):
    - `python sac_procgen.py --env-id coinrun --hf-video-encoder-id google/vit-base-patch16-224 --cuda`
  - Example (video encoder):
    - `python sac_procgen.py --env-id coinrun --hf-video-encoder-id facebook/timesformer-base-finetuned-k400 --cuda`
- Optional flags:
  - `--hf-freeze True` (default): do not finetune HF weights
  - `--hf-use-cls True` (default): use CLS token when available; otherwise mean-pool

Notes
- The encoder runs on single frames per step. For video backbones, we send a single-frame clip.
- Using a large backbone may slow training. Freezing avoids extra compute and keeps SAC stability (no shared gradients across actor/critics).
- If your chosen model requires additional deps (e.g., decord/av), prefer an image backbone like ViT.

