# Video-Language Critic (VLC): Transferable Reward Functions for Language-Conditioned Robotics

## Setup

```
conda env create -f vlc.yml
conda activate vlc

pip install -e .
```

## Usage

Use the expert scripts in Meta-World to collect demonstrations and save the corresponding videos as mp4 files in DATA_DIRECTORY.
The exact dataset used in the paper will be made available at the camera-ready stage.

To train VLC on Meta-World videos contained in DATA_DIRECTORY:
```
torchrun --master_port MASTER_PORT --nproc_per_node 1 main_task_retrieval.py --num_thread_reader 6 --epochs 20 --batch_size 64 --n_display 20 --data_path DATA_DIRECTORY --features_path DATA_DIRECTORY --output_dir EXPERIMENT_OUTPUT_DIRECTORY --seed 1 --efficient_subsample --video_max_len -1 --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 64 --datatype mw --loss_type sequence_ranking_loss --ranking_loss_weight 33 --feature_framerate 5 --coef_lr 1e-3 --freeze_layer_num 0 --use_failures_as_negatives_only --slice_framepos 3 --test_slice_framepos 2 --augment_images --linear_patch 2d --sim_header tightTransf --pretrained_clip_name ViT-B/32 --main_eval_metric loss --other_eval_metrics=strict_auc,tv_MeanR,vt_MedianR,vt_R1,tv_R1,tv_R10,tv_R5,labeled_auc,vt_loss --do_train --n_ckpts_to_keep -1
```


To train VLC on Open X-Embodiment videos, first download the dataset using instructions in https://github.com/google-deepmind/open_x_embodiment.

We used the dataset metadata to download only splits that include language annotations, and placed them in OPENX_DATA_DIRECTORY.

To optionally use VLMBench as validation data, collect demonstration videos from the pick task and place them in `./vlm_test_labeled_processed_picks`. The exact dataset used in the paper will be made available at the camera-ready stage.

To train VLC on Open X videos contained in OPENX_DATA_DIRECTORY:
```
torchrun --master_port MASTER_PORT --nproc_per_node 1 main_task_retrieval.py --num_thread_reader 6 --epochs 15 --batch_size 64 --n_display 200 --data_path OPENX_DATA_DIRECTORY --output_dir OPENX_EXPERIMENT_OUTPUT_DIRECTORY --test_data_path vlm_test_labeled_processed_picks --test_features_path vlm_test_labeled_processed_picks --test_set_name vlm_test_labeled_processed_picks --test_datatype vlm --seed 1 --efficient_subsample --video_max_len -1 --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 64 --datatype openx --loss_type cross_entropy --feature_framerate 5 --coef_lr 1e-3 --freeze_layer_num 0 --use_failures_as_negatives_only --slice_framepos 3 --test_slice_framepos 2 --augment_images --linear_patch 2d --sim_header tightTransf --pretrained_clip_name ViT-B/32 --main_eval_metric loss --other_eval_metrics=strict_auc,tv_MeanR,vt_MedianR,vt_R1,tv_R1,tv_R10,tv_R5,labeled_auc,vt_loss --do_train --n_ckpts_to_keep -1
```

# Acknowledgments
Our code is based on [CLIP4Clip](https://github.com/ArrowLuo/CLIP4Clip).
