# pareto optimal preference learning
 
Code for the paper "Pareto Optimal Learning from Preference with Hidden Context" 

Anonymous Authors


### Requirements

Requirements for this code are listed in `requirements.txt`. You can install them using the following command:

```
pip install -r requirements.txt
```


## Outline

- Synthetic Stateless Experiment Reproduction
- Minigrid Experiment Reproduction
- LLM Experiment Reproduction

## Synthetic Stateless Experiment Reproduction

To reproduce the synthetic stateless experiment, the hydra config file is located at `configs/synthetic_stateless.yaml`. You can run the experiment using the following command:

```
python -m experiments.synthetic_stateless
```

This will output 10 runs of the experiment with different seeds. The results will be saved in a dated folder in the `multirun` directory.

## Minigrid/Metaworld Experiment Reproduction

For now, we do not provide the exact demos used in the paper, but encourage you to generate your own to fully reproduce the results as detailed below.


## Generating and Labeling Your Own Minigrid/Metaworld Demos

First we must generate a series of policies that generate the demos. To do this, we use stable-baselines3 to train a PPO agent on the minigrid environment. The training script is located at `experiments/demo_prep/train_policies.py` for minigrid, and `experiments/demo_prep/metaworld_train_policies.py` for metaworld. You can run the script using the following command:

```
python -m experiments.demo_prep.train_policies
```

The config for this file can be found at `experiments/config/demos.yaml`. Here, you can configure the number of policies to train and the number of steps to train each policy for, and what environment to train on.

Once you run the script, the policies will be saved in the `policies` directory.

Then, you can record demos from these policies and label them based on log likelihoods using the command:

```
python -m experiments.demo_prep.record_demos
```

Finally, to run experiments, configure the settings in `experiments/configs/rl_domains` and run using the command:

```
python -m experiments.rl_domains
```

The outputted sets of policies should save to a time-stamped folder that can be used for analysis (e.g. with `personalization.py`)

Note: ensure that normalization is False for Metaworld, as the policy is meant to output more than one continuous action at a time. For Minigrid, the output is a softmax distribution over the actions, so normalization should be True. You must ensure that the policy chosen for Minigrid uses softmax as the output activation function. (To do this, uncomment lines 63-64 in `src/popl`). We aim to make this more user-friendly when open sourcing this code.

## LLM Experiments

To reproduce results from the LLM portion of the experiments, please navigate to the folder `popl_llm` for more instructions.