# DARA: DYNAMICS-AWARE REWARD AUGMENTATION IN OFFLINE REINFORCEMENT LEARNING


Code to reproduce the experiments of DARA: DYNAMICS-AWARE REWARD AUGMENTATION IN OFFLINE REINFORCEMENT LEARNING. 

## Installation

1. Install [rlkit](https://github.com/rail-berkeley/rlkit) and [d4rl](https://github.com/rail-berkeley/d4rl). 
2. Download offline RL algorithms, including [model-free](https://github.com/rail-berkeley/d4rl_evaluations)  (BC, BEAR,  BRAC, BCQ, CQL, AWR) and [model-based](https://github.com/polixir/OfflineRL) (MOPO, COMBO). The MABE algorithm is implemented based on the MOPO source code. 

## Tasks and Dataset 

<center class="half">
    <img src="figure\env_hopper.png" width="150" height="200"/>
    <img src="figure\env_walker2d.png" width="150" height="200" />
    <img src="figure\env_halfcheetah.png" width="200" height="200"/>
    <img src="figure\env_sim_dog.png" width="200" height="200"/>
    <img src="figure\env_real_dog.png" width="240" height="200"/>

1. To characterize the offline dynamics shift,  we consider the Hopper, Walker2d and Halfcheetah from the Gym-MuJoCo environment, using offline samples from [D4RL](https://github.com/rail-berkeley/d4rl) as our target offline dataset. For the source dataset, we change the body mass of agents or add joint noise to the motion, and, similar to D4RL, collect the Random, Medium, Medium-R and Medium-E offline datasets for the three environments by using [SAC ](https://github.com/rail-berkeley/rlkit). 
2. In the sim2real setting (for the quadruped robot), we collect the target offline data using five target behavior policies in the real-world with changing terrains, and collect the ”Medium”, ”Medium-Replay” (Medium-R), ”Medium-Expert” (Medium-E), ”Medium-Replay-Expert” (Medium-R-E) source offline data in the simulator, where ”Medium-Replay-Expert” denotes mixing equal amounts of ”Medium-Replay” data and expert demonstrations. For A1 robot data, we build a simulation environment for the [A1 quadruped robot](https://www.unitree.com/products/a1) and deployed the learned policy to the real robot via [C++-python interface](https://github.com/google-research/motion_imitation). We use [SAC](https://github.com/DLR-RM/stable-baselines3) to collect the simulation (source) and real (target) robot data. 
3. The state , action and reward function of the Hopper, Walker2d and Halfcheetah are consistent with that of [D4RL](https://github.com/rail-berkeley/d4rl).  For details about the configuration of the A1 robot, please refer to the appendix of the paper. 

The dataset is [available](https://drive.google.com/file/d/1QuE4mo7VTiD2igzNi6B6-RMplL6P8GpQ/view?usp=sharing), and the configuration is shown below. 

| Environment   | Task Name                | # Sample Size |
| ------------- | ------------------------ | ------------- |
| Hopper        | body-random              | $10^6$        |
| Hopper        | body-medium              | $10^6$        |
| Hopper        | body-medium-replay       | $10^6$        |
| Hopper        | body-medium-expert       | $2*10^6$      |
| Hopper        | joint-random             | $10^6$        |
| Hopper        | joint-medium             | $10^6$        |
| Hopper        | joint-medium-replay      | $10^6$        |
| Hopper        | joint-medium-expert      | $2*10^6$      |
| Walker2d      | body-random              | $10^6$        |
| Walker2d      | body-medium              | $10^6$        |
| Walker2d      | body-medium-replay       | $10^6$        |
| Walker2d      | body-medium-expert       | $2*10^6$      |
| Walker2d      | joint-random             | $10^6$        |
| Walker2d      | joint-medium             | $10^6$        |
| Walker2d      | joint-medium-replay      | $10^6$        |
| Walker2d      | joint-medium-expert      | $2*10^6$      |
| HalfCheetah   | body-random              | $10^6$        |
| HalfCheetah   | body-medium              | $10^6$        |
| HalfCheetah   | body-medium-replay       | $10^6$        |
| HalfCheetah   | body-medium-expert       | $2*10^6$      |
| HalfCheetah   | joint-random             | $10^6$        |
| HalfCheetah   | joint-medium             | $10^6$        |
| HalfCheetah   | joint-medium-replay      | $10^6$        |
| HalfCheetah   | joint-medium-expert      | $2*10^6$      |
| A1 robot sim  | sim-random               | $10^6$        |
| A1 robot sim  | sim-medium               | $10^6$        |
| A1 robot sim  | sim-medium-replay        | $10^6$        |
| A1 robot sim  | sim-medium-expert        | $2*10^6$      |
| A1 robot sim  | sim-medium-replay-expert | $2*10^6$      |
| A1 robot real | real-medium-expert       | $3*10^4$      |



# Training 

To distinguish between source domain and target domain, you can train the classifier. for example:

`python3 train.py --seed=0 --env_num=41 --envs=medium_expert --halfcheetah_medium_expert --cuda --isMediumExpert=True`

`isMediumExpert` indicates whether to use medium-expert source data.

Then revise the reward in the source dataset:

`python3 test.py --seed=0 --env_num=41 --envs=medium_expert --halfcheetah_medium_expert --cuda --itr=50`

`--env_num=41 or 46` means to use body-xxx data (41) or joint-xxx data (46).

`--itr` means the itr-th classifiers are used.



# Usage

We need to put the modified source data in the corresponding folder, and then combine the source and target data to train the model. For example:

`#awr:
python3 scripts/run_script.py --env=walker2d-random-v0 --isMediumExpert=False --data_path=d4rl/ours/dataset/Walker2d/body_mass/body_random.hdf5 --seed=0`

`--data_path` represents the file path of the source data.

`bc:
python3 scripts/train_bc.py --env_name=walker2d-random-v0 --isMediumExpert=False --data_path=d4rl/ours/dataset/Walker2d/body_mass/body_random.hdf5 `

`#bear:`

`python3 examples/bear_hdf5_d4rl.py --env=walker2d-random-v0 --isMediumExpert=False data_path=d4rl/ours/dataset/Walker2d/body_mass/body_random.hdf5`
`cql:`

`python3 examples/cql_mujoco_new.py --env=walker2d-random-v0 --policy_lr=1e-4 --seed=10 --lagrange_thresh=-1.0 --min_q_weight=5.0 --gpu=0 --min_q_version=3 --data_path=d4rl/ours/dataset/Walker2d/body_mass/body_random.hdf5`
`mopo:
python3 examples/train_d4rl.py --algo_name=mopo --exp_name=d4rl-walker2d-random-mopo --task=d4rl-walker2d-random-v0 --isMediumExpert=False --data_path=d4rl/ours/dataset/Walker2d/body_mass/body_random.hdf5`











































​	

