# Design from Policies: Conservative Model-based Optimization for Offline Reinforcement Learning 

This repository is the implementation of Design from Policies: Conservative Model-based Optimization for Offline Reinforcement Learning. We reframe the offline policy learning as the problem of model-based optimization, which fits a score model and then conducts offline inference (finding the optimal policy) by performing optimization against the fitted score model. 

## Requirements

In this paper, we mainly implement our algorithm based on [d3rlpy](https://github.com/takuseno/d3rlpy "d3rlpy"), and use [d4rl](https://github.com/rail-berkeley/d4rl "d4rl") to complete the experiments. 


## Training 

To train the model(s) in the paper, run this command:

```train
python train.py --gpu 0 --dataset 'antmaze-large-diverse-v2' --num 500 --size 2 --dim 5 --seed 1 --type 'rewa'
```
where *num* is the number of sub-tasks, *size* is the number of trajectories in each sub-task and *dim* is the dimension of the embedding for each sub-task. *type* is the decomposition rule. Note: type "rewa" in code corresponds to the "rank" decomposition rule in DROP paper. 


