# Code supplement of ICLR 2026 Submission 24611

## Overview

For reproducability reasons we include the key parts of our code as supplementary material:
- 'rsl_rl' is a fork of https://github.com/leggedrobotics/rsl_rl containing our modifications to the commonly used RL training setup for robot motion control.
- 'p4rl' contains the exploration-based data collection and training of the PIDM model.
With the publication of the paper we aim to include this as a feature into Isaac Lab for the benefit of the rest of the robotics learning community.


General procedure to deploy our method: 

Exploration-based Data Collection -> Pretraining -> RL. Following commands are of ANYmal-D; can be substituted with Go1 or G1. 


## Exploration-based Data Collection

Exploration-based data collection. 

Commands to run (running on cluster with at least 128GB RAM is highly recommended): 

On flat terrain:

```
./isaaclab.sh -p ./scripts/p4rl/rsl_rl/train.py --task P4RL-INV-ensemble-Exploration-Anymal-D-v0 --logger wandb --save_trajectories_prob 1.0 --num_envs 4096 --video --video_length 200 --video_interval 240 --max_iterations 1000 --headless
```

On rough terrain:

```
./isaaclab.sh -p ./scripts/p4rl/rsl_rl/train.py --task P4RL-INV-ensemble-Exploration-Rough-Anymal-D-v0 --logger wandb --save_trajectories_prob 1.0 --num_envs 4096 --video --video_length 200 --video_interval 240 --max_iterations 500 --headless
```

Args:

`--save_trajectories_prob` determines how many percentages of trajectories will be saved. 

`--video --video_length 200 --video_interval 240` records video every 10 iterations (240=24*10), is helpful and almost necessary for examine the behaiviors during exploration. 

## PIDM pretraining


```
python rsl_rl/rsl_rl/addons/invdynamics/inv_dynamics_training_offline.py
```

Configure the hyperparameters, network architectures, dataset path etc. in that script before launching the training. 

The pretraining process typically requires about 100 epochs to converge reasonably. 

## Reinforcement Learning

Once the trained model is obtained, put the link to the trained model in corresponding configuration files like `p4rl/source/p4rl/p4rl/tasks/locomotion/velocity/config/anymal_d/agents/rsl_rl_ppo_cfg.py`

```
self.policy = P4RLAsymmetricActorCriticCfg(
            actor_submodule_config=InvDynamicsMLPConfig(
                dim_states=33, 
                dim_actions=12, 
                input_timesteps=5,
                representation_dim=256,
                mode="inv",
                weight_path="path/to/your/weights/file",
                finetune_frozen=False,
            ),
            critic_submodule_config=InvDynamicsMLPConfig(
                dim_states=33, 
                dim_actions=12, 
                input_timesteps=5,
                representation_dim=256,
                mode='inv',
                weight_path="path/to/your/weights/file",
                finetune_frozen=False,
            ),
            ...
)
```

### Locomotion

Vanilla

```
./isaaclab.sh -p ./scripts/p4rl/rsl_rl/train.py --task P4RL-Velocity-Flat-Anymal-D-v0 --headless --seed -1 --logger wandb

```

PIDM (Random Init)

```
./isaaclab.sh -p ./scripts/p4rl/rsl_rl/train.py --task P4RL-PIDM-Rand-Velocity-Flat-Blind-Anymal-D-v0 --headless --seed -1 --logger wandb

```

PIDM (Pretrained)

```
./isaaclab.sh -p ./scripts/p4rl/rsl_rl/train.py --task P4RL-PIDM-ExplorationMixed-Velocity-Flat-Blind-Anymal-D-v0 --headless --seed -1 --logger wandb

```