# Subtask-Aware Visual Reward Learning from Segmented Demonstrations}

## How to run the code

### Install dependencies

```
conda create -y -n reds python=3.8
conda activate reds

pip install --upgrade pip
```

## Training Reward Model

### FunritureBench
```python
# REDS (initial training, only with expert demonstrations)
CUDA_VISIBLE_DEVICES=0 XLA_PYTHON_CLIENT_PREALLOCATE=false python -m bpref_v2.reward_learning.train_reds --comment={experiment_name} --furniturebench.data_dir={input_data_path} --logging.output_dir={output_path} --batch_size=32 --model_type=REDS --early_stop=False --log_period=100 --eval_period=1000 --save_period=1000 --train_steps=5000 --eval_steps=10 --furniturebench.task_name=one_leg --furniturebench.num_demos={number of demos, to use all demos in the folder, use -1} --furniturebench.env_type=furniturebench --env=furniturebench-one_leg --furniturebench.window_size=4 --furniturebench.skip_frame=1 --reds.lambda_supcon=1.0 --reds.lambda_epic 1.0 --reds.transfer_type=clip_vit_b16 --reds.embd_dim=512 --reds.output_embd_dim=512 --augmentations="crop|jitter" --furniturebench.output_type raw --logging.online=True --reds.epic_on_neg_batch=False --reds.supcon_on_neg_batch=False --furniturebench.pearson_size=8

# REDS (iterative training with expert + suboptimal demonstrations)
CUDA_VISIBLE_DEVICES=0 XLA_PYTHON_CLIENT_PREALLOCATE=false python -m bpref_v2.reward_learning.train_reds --comment={experiment_name} --furniturebench.data_dir={input_data_path} --logging.output_dir={output_path} --batch_size=32 --model_type=REDS --early_stop=False --log_period=100 --eval_period=1000 --save_period=1000 --train_steps=5000 --eval_steps=10 --furniturebench.task_name=one_leg --furniturebench.num_demos={number of demos, to use all demos in the folder, use -1} --furniturebench.env_type=furniturebench --env=furniturebench-one_leg --furniturebench.window_size=4 --furniturebench.skip_frame=1 --reds.lambda_supcon=1.0 --reds.lambda_epic 1.0 --reds.transfer_type=clip_vit_b16 --reds.embd_dim=512 --reds.output_embd_dim=512 --augmentations="crop|jitter" --furniturebench.output_type raw --logging.online=True --use_failure=True --reds.epic_on_neg_batch=True --reds.supcon_on_neg_batch=True --furniturebench.pearson_size=8

# DrS
CUDA_VISIBLE_DEVICES=0 XLA_PYTHON_CLIENT_PREALLOCATE=false python -m bpref_v2.reward_learning.train_reds --comment={experiment_name} --furniturebench.data_dir={input_data_path} --logging.output_dir={output_path} --batch_size=32 --model_type=DrS --early_stop=False --log_period=100 --eval_period=1000 --save_period=1000 --train_steps=5000 --eval_steps=10 --furniturebench.task_name=one_leg --furniturebench.num_demos={number of demos, to use all demos in the folder, use -1} --furniturebench.env_type=furniturebench --env=furniturebench-one_leg --furniturebench.window_size=4 --furniturebench.skip_frame=1 --augmentations="crop|jitter" --furniturebench.output_type raw --logging.online=True --use_failure=True
```

### Metaworld, RLBench
```python
# REDS (initial training, only with expert demonstrations)
CUDA_VISIBLE_DEVICES=0 XLA_PYTHON_CLIENT_PREALLOCATE=false python -m bpref_v2.reward_learning.train_reds --comment={experiment_name} --robot.data_dir={input_data_path} --logging.output_dir={output_path} --batch_size=32 --model_type=REDS --early_stop=False --log_period=100 --eval_period=1000 --save_period=1000 --train_steps=5000 --eval_steps=10 --robot.task_name={task_name} --robot.num_demos={number of demos, to use all demos in the folder, use -1} --robot.env_type={metaworld|rlbench} --env={metaworld|rlbench_{task_name}} --robot.window_size=4 --robot.skip_frame=1 --reds.lambda_supcon=1.0 --reds.lambda_epic 1.0 --reds.transfer_type=clip_vit_b16 --reds.embd_dim=512 --reds.output_embd_dim=512 --augmentations="crop|jitter" --robot.output_type raw --logging.online=True --reds.epic_on_neg_batch=False --reds.supcon_on_neg_batch=False --robot.pearson_size=8

# REDS (iterative training with expert + suboptimal demonstrations)
CUDA_VISIBLE_DEVICES=0 XLA_PYTHON_CLIENT_PREALLOCATE=false python -m bpref_v2.reward_learning.train_reds --comment={experiment_name} --robot.data_dir={input_data_path} --logging.output_dir={output_path} --batch_size=32 --model_type=REDS --early_stop=False --log_period=100 --eval_period=1000 --save_period=1000 --train_steps=5000 --eval_steps=10 --robot.task_name={task_name} --robot.num_demos={number of demos, to use all demos in the folder, use -1} --robot.env_type={metaworld|rlbench} --env={metaworld|rlbench_{task_name}} --robot.window_size=4 --robot.skip_frame=1 --reds.lambda_supcon=1.0 --reds.lambda_epic 1.0 --reds.transfer_type=clip_vit_b16 --reds.embd_dim=512 --reds.output_embd_dim=512 --augmentations="crop|jitter" --robot.output_type raw --logging.online=True --use_failure=True --reds.epic_on_neg_batch=True --reds.supcon_on_neg_batch=True --robot.pearson_size=8

# DrS
CUDA_VISIBLE_DEVICES=0 XLA_PYTHON_CLIENT_PREALLOCATE=false python -m bpref_v2.reward_learning.train_reds --comment={experiment_name} --robot.data_dir={input_data_path} --logging.output_dir={output_path} --batch_size=32 --model_type=DrS --early_stop=False --log_period=100 --eval_period=1000 --save_period=1000 --train_steps=5000 --eval_steps=10 --robot.task_name={task_name}  --robot.window_size=4 --robot.skip_frame=1 --robot.num_demos={number of demos, to use all demos in the folder, use -1} --robot.env_type={metaworld|rlbench} --env={metaworld|rlbench_{task_name}} --augmentations="crop|jitter" --robot.output_type raw --logging.online=True --use_failure=True

# ORIL
CUDA_VISIBLE_DEVICES=0 XLA_PYTHON_CLIENT_PREALLOCATE=false python -m bpref_v2.reward_learning.train_reds --comment={experiment_name} --robot.data_dir={input_data_path} --logging.output_dir={output_path} --batch_size=32 --model_type=ORIL --early_stop=False --log_period=100 --eval_period=1000 --save_period=1000 --train_steps=5000 --eval_steps=10 --robot.task_name={task_name}  --robot.window_size=4 --robot.skip_frame=1 --robot.num_demos={number of demos, to use all demos in the folder, use -1} --robot.env_type={metaworld|rlbench} --env={metaworld|rlbench_{task_name}} --augmentations="crop|jitter" --robot.output_type raw --logging.online=True --use_failure=True
```


## Acknowledgments

Our code is based on the implementation of [PreferenceTransformer](https://github.com/csmile-1006/PreferenceTransformer).
