To setup the dataset and do the training from scratch, follow the following steps:

I. Run src/actor_critic.py to learn the policies that will be used to collect the trajectories later. The input arguments include:
1. env: the learning environment and its version, e.g, Walker2d-v3. The options can be any environment from gym library.
2. alg: the RL algorithm, e.g, soft actor critic. The options include {ddpg,td3, and sac}.

(One need to manually choose the policies to collect dataset among all learned policies if they are not already specified)

II. Run src/collect_trajectory.py to collect the training dataset. The input arguments include:
1. env: the learning environment, e.g, Walker2d-v3
2. traj: the type of the policy to collect trajectories, e.g., expert policy. The options include {expert, medium, mixed}.
3. sample: the sampling method of pairs of trajectories for preference feedback, e.g, uniformly at random sampling. The options include {uniform}.
4. pref: the preference model, e.g, Bradley-Terry model. The options include {regular}

III. Run src/rlhf_training.py to do offline pbrl training on the dataset. The input arguments include:
1. env: the learning environment, e.g, Walker2d-v3
2. traj: the type of the policy to collect trajectories, e.g, expert
3. sample: the sampling method of pairs of trajectories for preference feedback, e.g, uniform
4. pref: the preference model, e.g, regular
5. data: the size of the training set
6. learn: the pbrl algorithm, e.g., behavior. The options include {test, baseline, uncertainty, behavior, brac}

