# Overview

This package provides the code for (i) training the VLM-H, (ii) extracting SNE simialrities from the latent space of VLM-H, (iii) reconstructing immediate human rewards (IHRs) over latent representations (RILR), (iv) downstream OPE (which takes as input the trajectories with reconstructed IHRs and estimate human returns). We also provide the offline trajectories used for the visual Q&A environment, as well as the trajectoires with reconstructed IHRs. The instruction for reproducing the results using PDIS as the downstream estimator, with our pre-trained behavioral clone, is also provided (no further training is needed for this case).

**************************************************************************************************************
ATTENTION
Given that OpenReview only allows uploading supplementary files no larger than hundreds of MBs, we could not provide the offline trajectories and target policy checkpoints in this copy. Please download the full package from https://www.dropbox.com/s/7rugicq2f18y8dd/code.tar.gz?dl=0 (19GB) in order to access the training data and checkpoints.
************************************************************************************************************

Steps (i)-(iii) share the same Python environment, while (iv) requires a separate environment as it was based on the implementation provided in https://github.com/google-research/google-research/tree/master/policy_eval. 


# Environmental Steup

## Environmental setup for steps (i)-(iii)

```
Python 3.8.12
tensorflow 1.15
tensorflow-probability 0.8.0
scikit-learn 1.0.2
numpy 1.22.2
scipy 1.5.3
tqdm 4.62.3
pandas 1.4.1
```

## Environmental setup for steps (iv)

```
Python 3.7.11
tensorflow 2.6.0
tensorflow-probability 0.14.1
numpy 1.21.5
scipy 1.7.3
tqdm 4.62.3
pandas 1.3.4
```

# Step (i)

To train VLM-H using the provided offline trajectories, one can start with configuring the hyper-parameters in the region highlighted in the script `train_vlm_h.py`. Then simply execute `python train_vlm_h.py`. The model checkpoint will be saved under `saved_model`, which will be used in the next step.

# Step (ii)

Now we feed all the offline trajectories into the trained VLM-H, and identify the K-neighbors in the latent space to be used in the RILR step. Specifically, the code for this step is provided in the format of Jupyter notebook, `gen_tsne_from_hf.ipynb` as we kept in there the t-SNE visualization over the behavioral trajectories, which illustrates the clustring behavior of the latent variables/encodings -- it can be checked without taking any actions other than just opening the notebook. The Figure 1 (mid) in the paper is generated by feeding in the trajectories collected from target policies into the encoder of the VLM-H.

# Step (iii)

To facilitate RILR, one can start with congifuring the hyper-parameters in `run_rilr.py` followed by executing `python run_rilr.py`. Once finished, one can use `save_traj_with_IHR_by_RILR.py` to replace all the environmental rewards by the reconstrcted IHRs for the offline trajectories, to prepare them to be used in the next step.

# Step (iv)

To used the offline trajectories (with reconstructed IHRs) above to evaluate any downstream OPE estimators, one can follow

```
python -m policy_eval.train_eval --logtostderr --d4rl --env_name=low --d4rl_policy_filename=<path_to_target_policy> --target_policy_std=0.0 --num_mc_episodes=1 --bootstrap=False --algo=<iw/fqe/dr/dual_dice> --noise_scale=0.0 --num_updates=<training_steps> --seed=<seed> --normalize_states=False --normalize_rewards=True
```

The all the target policies to be evaluated are saved in the folder `./target_policies/`. 

To reproduce the OPEHF results using PDIS as the downstread estimator, simply execute
```
sh iw.sh
cd policy_eval_results
python result_analysis.py
```

And the terminal should display 

```
MAE 0.5536538612270927 
Rank 0.7425282823862205 
Regret@1 0.0 
```

# Remark

This package is designed illustrate our implementations concisely for reviewing purposes. If the paper is accepted, we will reformulate our code into a library that can be distributed through Github/PiPI.