# POIL: Preference-based Optimization for Efficient Offline Imitation Learning Experiments: Dataset, Environment Details, and Usage Instructions

## Table of Contents

1. [Dataset Sources](#dataset-sources)
2. [Environment Setup](#environment-setup)
3. [Evaluation Metrics](#evaluation-metrics)
4. [Implementation Details](#implementation-details)
5. [Reproducibility](#reproducibility)
6. [Results](#results)
7. [Code Usage Instructions](#code-usage-instructions)
8. [Additional Notes](#additional-notes)

## Dataset Sources

Our experiments utilize two primary sources of expert demonstrations:

### 1. Google Research Value Dice Dataset

Source: [google-research/value_dice/datasets](https://github.com/google-research/google-research/tree/master/value_dice/datasets)

For each task in this dataset, we conducted single-trajectory experiments by extracting the first three trajectories to serve as expert demonstrations. This approach allows us to evaluate our methods' performance with limited expert data.

Tasks included:

- Hopper-v2
- HalfCheetah-v2
- Walker2d-v2

### 2. D4RL Dataset

Source: [Farama-Foundation/D4RL](https://github.com/Farama-Foundation/D4RL)

From the D4RL dataset, we focused on the MuJoCo-related datasets, specifically the expert-v2 versions. To create varied experimental conditions, we randomly sampled different percentages of transitions from the full datasets:

- 2% of transitions
- 5% of transitions
- 10% of transitions
- 100% of transitions (full dataset)

Environments used:

- Hopper-v2
- HalfCheetah-v2
- Walker2d-v2

## Environment Setup

We used OpenAI Gym for our environment simulations. The specific versions of the environments used were:

- Hopper-v2
- HalfCheetah-v2
- Walker2d-v2

Each environment was seeded for reproducibility, with the seed value specified in the experimental configuration.

### Seed Values

To ensure reproducibility and to account for the stochastic nature of our experiments, we used three different seed values(0, 1, 2).

## Evaluation Metrics

To assess the performance of our imitation learning algorithms, we used the following primary metric:

- Deterministic Return: The cumulative reward achieved by the learned policy when acting deterministically in the environment.

We evaluated the learned policies at regular intervals during training, using the following parameters:

- Evaluation Frequency: Every 500 training steps
- Number of Episodes: 1
- Horizon: 1000 steps

## Implementation Details

Our implementation uses PyTorch for neural network models and optimization. Key details include:

- Actor Network: MLP with ReLU activations
- Optimizer: Adam
- Learning Rate: Configurable, default 3e-4
- Weight Decay: Configurable, default 1e-3

For specific hyperparameters and configuration options, please refer to the `parse_args()` function in the main script.

## Reproducibility

To ensure reproducibility of our results, we:

1. Set a fixed random seed for PyTorch, NumPy, and the environment
2. Used deterministic CUDA operations where applicable
3. Saved all hyperparameters and configuration details alongside the results

## Results

Experimental results are saved as CSV files with the following columns:

- loss: Training loss at each evaluation step
- margin: Margin between expert and learned policy actions
- positive_reward: Reward for choosing expert actions
- negative_reward: Reward for choosing learned policy actions
- deterministic_return: Evaluation return using deterministic policy

Results are saved in a structured directory format:
`logs/{env_name}/{dataset_name}/{method}_weight-decay_{wd}_lr_{lr}_seed_{seed}_{method_specific_params}.csv`

For detailed analysis and comparison of results across different methods and datasets, please refer to the main paper and supplementary materials.

## Code Usage Instructions

To run the experiments, use the following command:

```bash
python main.py --expert_path <path_to_expert_data> --method <method_name> --env_name <environment_name> [additional_arguments]
```

### Required Arguments:

- `--expert_path`: Path to the expert dataset
- `--method`: Method to use (choices: "POIL", "SimPO", "ORPO", "SLiC_HF", "RRHF")
- `--env_name`: Name of the environment (e.g., "Hopper-v2", "HalfCheetah-v2", "Walker2d-v2")

### Optional Arguments:

- `--total_steps`: Total training steps (default: 100000)
- `--eval_freq`: Evaluation frequency (default: 500)
- `--beta`: Beta parameter (default: 0.1)
- `--lr`: Learning rate (default: 3e-4)
- `--gamma`: Gamma parameter (default: 1.0)
- `--Lambda`: Lambda parameter (default: 0)
- `--weight_decay`: Weight decay for the optimizer (default: 0)
- `--seed`: Seed for reproducibility (default: 0)

### Example Usage:

```bash
python main.py --expert_path ./data/hopper_expert --method POIL --env_name Hopper-v2 --seed 0 --beta 0.1 --lr 3e-4
```

To run experiments with all three seed values, you can use a bash script:

```bash
for seed in 0 1 2
do
    python main.py --expert_path ./data/hopper_expert --method POIL --env_name Hopper-v2 --seed $seed --beta 0.1 --lr 3e-4
done
```

This will run the experiment three times, once for each seed value.

## Additional Notes

- Results are saved in CSV format in the `logs` directory, with a structure that reflects the experimental parameters.
- For method-specific parameters (e.g., beta, gamma, lambda), refer to the parsing logic in the `parse_args()` function of the main script.
