# Correcting Data Distribution Mismatch in Offline Meta-Reinforcement Learning with Few-Shot Online Adaptation

> Offline meta-reinforcement learning (offline meta-RL) extracts knowledge from a given dataset of multiple tasks and achieves fast adaptation to new tasks. Recent offline meta-RL methods typically use task-dependent behavior policies (e.g., training RL agents on each individual task) to collect a multi-task dataset and learn an offline meta-policy. However, these methods always require extra information for fast adaptation, such as offline context for testing tasks or oracle reward functions. Offline meta-RL with few-shot online adaptation remains an open problem. In this paper, we first formally characterize a unique challenge under this setting: data distribution mismatch between offline training and online adaptation. This distribution mismatch may lead to unreliable offline policy evaluation and the regular adaptation methods of online meta-RL will suffer. To address this challenge, we introduce a novel mechanism of data distribution correction, which ensures the consistency between offline and online evaluation by filtering out out-of-distribution episodes in online adaptation. As few-shot out-of-distribution episodes usually has lower return, we propose a \textit{\textbf{G}reedy \textbf{C}ontext-based data distribution \textbf{C}orrection} approach, called GCC, which greedily infers how to solve new tasks. GCC diversely samples “task hypotheses” from the current posterior belief and selects a greedy hypothesis with the highest return to update the task belief. Our method is the first to provide an effective online adaptation without additional information, and can be combined with off-the-shelf context-based offline meta-training algorithms. Empirical experiments show that GCC achieves state-of-the-art performance on the Meta-World ML1 benchmark compared to baselines with/without offline adaptation. 

## Installation
To install locally, you will need to first install [MuJoCo](https://www.roboti.us/index.html). For task distributions in which the reward function varies (Cheetah, Ant), install MuJoCo150 or plus. Set `LD_LIBRARY_PATH` to point to both the MuJoCo binaries (`/$HOME/.mujoco/mujoco200/bin`) as well as the gpu drivers (something like `/usr/lib/nvidia-390`, you can find your version by running `nvidia-smi`).

For the remaining dependencies, create conda environment by
```
conda env create -f environment.yaml
```

<!-- For task distributions where the transition function (dynamics)  varies  -->

**For Walker environments**, MuJoCo131 is required.
Simply install it the same way as MuJoCo200. To swtch between different MuJoCo versions:

```
export MUJOCO_PY_MJPRO_PATH=~/.mujoco/mjpro${VERSION_NUM}
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/.mujoco/mjpro${VERSION_NUM}/bin
``` 

The environments make use of the module `rand_param_envs` which is submoduled in this repository. Add the module to your python path, `export PYTHONPATH=./rand_param_envs:$PYTHONPATH` (Check out [direnv](https://direnv.net/) for handy directory-dependent path managenement.)


This installation has been tested only on 64-bit CentOS 7.2. The whole pipeline consists of two stages: **data generation** and **Offline RL experiments**:

## Data Generation



Example of training policies and generating trajectories on multiple tasks:

```
# Point-Robot
CUDA_VISIBLE_DEVICES=3 python policy_train.py ./configs/sparse-point-robot.json
```


Generated data will be saved in `./data/`

## Offline RL Experiments
Experiments are configured via `json` configuration files located in `./configs`. Basic settings are defined and described in `./configs/default.py`. To reproduce an experiment, run: 
```
# Point-Robot
bash run.sh
# 
```
By default the code will use the GPU - to use CPU instead, set `use_gpu=False` in the corresponding config file.
