# Running Reward Learning
To run reward learning, run the following command:
```



python TREX_ensemble.py --voi dis --env_name halfcheetah-medium-expert-v2 --initial_pairs 5 --num_rounds 10 --num_queries 2 --seed 205  --retrain_num_iter 5
CUDA_VISIBLE_DEVICES=3 python TREX_ensemble.py --voi myucb --env_name halfcheetah-medium-expert-v2 --initial_pairs 1 --num_rounds 10 --num_queries 1 --seed 385 --retrain_num_iter 20 --num_iter 20

CUDA_VISIBLE_DEVICES=3 python TREX_ensemble.py --voi myucb --env_name halfcheetah-medium-v2 --initial_pairs 1 --num_rounds 10 --num_queries 1 --seed 385 --retrain_num_iter 20 --num_iter 20

# 5 dis
# 15 myucb
# 25 contrastive
# 35 test

CUDA_VISIBLE_DEVICES=6 python TREX_ensemble.py --voi contrastive --env_name antmaze-large-diverse-v0 --initial_pairs 5 --num_rounds 10 --num_queries 5 --seed 35 --retrain_num_iter 20 --num_iter 20

CUDA_VISIBLE_DEVICES=4 python TREX_ensemble.py --voi dis --env_name halfcheetah-medium-v2 --initial_pairs 1 --num_rounds 10 --num_queries 1 --seed 285 --retrain_num_iter 20 --num_iter 20

CUDA_VISIBLE_DEVICES=7 python TREX_ensemble.py --voi dis --env_name antmaze-medium-play-v0 --initial_pairs 50 --num_rounds 50 --num_queries 50 --seed 35 --retrain_num_iter 50 --num_iter 50

python TREX_ensemble_contrastive.py --voi greedy --env_name halfcheetah-medium-replay-v2 --initial_pairs 50 --num_rounds 10 --num_queries 10 --seed 5
python TREX_ensemble_contrastive.py --voi greedy --env_name halfcheetah-medium-expert-v2 --initial_pairs 50 --num_rounds 10 --num_queries 10 --seed 7
```



voi list:
dis / info : common active learning technique used in the baseline paper
greedy: greedy oracle + random
myucb: ours ucb query selection

TREX_ensemble.py: preference-based reward learning
TREX_ensemble_contrastive.py: contrastive reward learning. voi does not affect this method.
Use initial_pairs=5 and num_queries=2 so that we can test performance of 5-25 queries.
# TODO: more training in each iteration? (increasing retrain_num_iter)



antmaze:
# 5 50+10 dis
# 15 1+1 dis
# 25 50+50 dis
# 35 50+50 50 rounds

# 0 initial
# 5 relu
# 135 sigmoid
# 145 sigmoid 20+5
# 155 sigmoid greed+random 20+5
# 165 sigmoid ucb
# 175 5+1 greed+random
# 185 5+2 greed+random
# 195 5+2 ucb
# 205 5+2
# 215 5+2 greed+random retrain 5
# 225 5+2 ucb retrain 5
# 235 greed+random retrain 20
# 245 ucb retrain 20
# 255 dis retrain 20
# 265-285 1+1 greed ucb dis
# 295 1+1 mean only
# 305 ucb test
# 315 ucb test
# 325 ucb min test
# 335 5+5
# 345 5+5 oracle
# 355 5+5 dis
# 365 ucb varonly
# 375 reverse test
# 385 0 test
# 7 contrastive
# 15 greedy
# 25 ucb
# 35 greedy high return(wrong)
# 45 true greedy high both
# 55 true greedy high+random
# 65 num query 5 both
# 75 num query 5 high+random
# 85 num query 5 ori
# 95 mean+0.5*std
# 105 0.1
# 115 1
# 125 test 0.1
- Use `TREX_dropout.py` instead of `TREX_ensemble.py` to use dropout instead of ensemble of models to represent uncertainty.
- `--voi` specifies the variant of estimated value of information used for active query selection(dis = disagreement, info= information gain). If left empty, uses random queries. 
- `maze2d-medium-dense-v1` can be replaced with other environments.
- `--initial_pairs` specifies the initial number of pairs of trajectories used to train the reward models.
- `--num_rounds` specifies the number of rounds of querying for additional pairs of trajectories and labels.
- `--num_queries` specifies the number of query pairs per round.
- The model predictions will be saved to the `rewards/` directory by default.