# ICLR 2026 Code Sumbission (Causally Robust Reward Learning from Reason-Augmented Preference Feedback)

Python environment dependencies:
- python 3.9
- requirements specified under "./recouple/environment_gpu.yaml"
- Metaworld (requires gym==0.23.0)
- ManiSkill3

Brief code explanations/paths:
- ManiSkill environments for RQ1: "./ManiSkill3/envs"
- Metaworld environments for RQ2: "./ManiSkill3/envs"
- Core algorithms: "./recouple/research/algs"
    - ReCouPLe-EC: "./recouple/research/algs/rpl_proj_eq.py"
    - ReCouPLe-IC: "./recouple/research/algs/rpl_proj_2bt.py"
    - BT (baseline): "./recouple/research/algs/piql.py"
    - BT-Multi (baseline): "./recouple/research/algs/mtpiql.py"
    - RFP (baseline): "./recouple/research/algs/rpl.py"
    - IQL (common for all offline RL policy learning experiments): "./recouple/research/algs/offline/iql.py"


Sample Experiment Procedure (for RQ2 Metaworld experiments):

1. Collect trajectories 
```sh
python scripts/create_metaworld_dataset.py --env push-wall-v2 --path push-wall-v2-valid
```

2. Create preference dataset for training, 
```sh
python scripts/create_metaworld_comparison_dataset_with_reason.py --envs pick-place-v2 --paths pick-place-v2 --output comparison_dataset --seed 1
python scripts/create_metaworld_comparison_dataset_with_reason.py --envs pick-place-wall-v2 push-v2 push-wall-v2 --paths pick-place-wall-v2 push-v2 push-wall-v2 --output comparison_dataset --seed 1
```

3. Run jobs
```sh
sh metaworld_jobs/recouple_eq.sh
```


