# Pre-trained Visual Representations

**Robomimic** supports multiple pre-trained visual representations and offers integration for adapting observation encoders to the desired pre-trained visual representation encoders.

## Terminology

First, let's clarify the semantic distinctions when using different pre-trained visual representations:

- **Backbone Classes** refer to the various pre-trained visual encoders. For instance, `R3MConv` and `MVPConv` are the backbone classes for using [R3M](https://arxiv.org/abs/2203.12601) and [MVP](https://arxiv.org/abs/2203.06173) pre-trained representations, respectively.
- **Model Classes** pertain to the different sizes of the pretrained models within each selected backbone class. For example, `R3MConv` has three model classes - `resnet18`, `resnet34`, and `resnet50`, while `MVPConv` features five model classes - `vits-mae-hoi`, `vits-mae-in`, `vits-sup-in`, `vitb-mae-egosoup`, and `vitl-256-mae-egosoup`.

## Examples

Using pre-trained visual representations is simple. Each pre-trained encoder is defined by its `backbone_class`, `model_class`, and whether to `freeze` representations or finetune them. Please note that you may need to refer to the original library of the pre-trained representation for installation instructions.

If you are specifying your config with code (as in `examples/train_bc_rnn.py`), the following are example code blocks for using pre-trained representations:

```python
# R3M
config.observation.encoder.rgb.core_kwargs.backbone_class = 'R3MConv'                         # R3M backbone for image observations (unused if no image observations)
config.observation.encoder.rgb.core_kwargs.backbone_kwargs.r3m_model_class = 'resnet18'       # R3M model class (resnet18, resnet34, resnet50)
config.observation.encoder.rgb.core_kwargs.backbone_kwargs.freeze = True                      # whether to freeze network during training or allow finetuning
config.observation.encoder.rgb.core_kwargs.pool_class = None                                  # no pooling class for pretraining model

# MVP
config.observation.encoder.rgb.core_kwargs.backbone_class = 'MVPConv'                                   # MVP backbone for image observations (unused if no image observations)
config.observation.encoder.rgb.core_kwargs.backbone_kwargs.mvp_model_class = 'vitb-mae-egosoup'         # MVP model class (vits-mae-hoi, vits-mae-in, vits-sup-in, vitb-mae-egosoup, vitl-256-mae-egosoup)
config.observation.encoder.rgb.core_kwargs.backbone_kwargs.freeze = True                      # whether to freeze network during training or allow finetuning
config.observation.encoder.rgb.core_kwargs.pool_class = None                                  # no pooling class for pretraining model

# Set data loader attributes for image observations
config.train.num_data_workers = 2                           # 2 data workers for image datasets
config.train.hdf5_cache_mode = "low_dim"                    # only cache non-image data         

# Ensure that you are using image observation modalities, names may depend on your dataset naming convention
config.observation.modalities.obs.rgb = [
                    "agentview_image",
                    "robot0_eye_in_hand_image"
                ]                                                  
```

Alternatively, if you are using a config json, you can set the appropriate keys in your json.
