# Complex Behaviour Project

## Environment Setup

### Overview  
The following command creates a Singularity Image Format (SIF) file named `project.sif` from a definition file `project.def`:

```bash
sudo singularity build project.sif project.def
```

# Run a command inside the built container
```bash
singularity exec project.sif python /path/to/script.py
```

## Training Mixture model

```bash
python main.py \
    --model="mop" \           # policy class: Mixture of Policies (MoP) algorithm  
    --env="ant" \             # env name
    --ac_model="mpf" \        # actor–critic network architecture: mixture policy network   
    --layers=256,256 \        # three hidden layers of size 256 neurons each  
    --gamma=0.9 \             # discount factor for future rewards (γ = 0.9)  
    --log_level=DEBUG \       # verbosity level: DEBUG  
    --epochs=100 \            # total number of training epochs  
    --eplen=100 \             # episode length (time steps per episode)  
    --alpha=0.1 \             # specialization parameter α = 0.1  
    --n_components=16 \       # number of mixture components in the policy  
    --exp_name="exp_name"     # experiment identifier string  

```

## Live without any control

```bash
python live_utils.py --env=Humanoid-v5  --path=data/exp_name/exp_name_s0/pyt_save/model.pt
```

nc_16 in the model name means that this model has 16 components.

## Live with any control


## Train Baseline SAC/PPO
Reults will be saved in baselines_logs directory 

```bash

python baseline.py \
    --env="ant_maze" \           # Gymnasium environment ID: Ant Maze task  
    --algo="sac" \               # RL algorithm: Soft Actor-Critic (off-policy, entropy-regularized)  
    --exp_name="sac" \           # experiment identifier for saving logs and models  
    --total_timesteps=1000000 \  # total number of environment steps to train (1×10^6)  
    --eval_freq=10000            # evaluate and checkpoint every 10 000 steps  

```

## Train PPO or DQN over indicies
Reults will be saved in logs directory

With limited variance:

```bash
python hrl/main.py \
    --hrl_exp_name="exp_name" \  # hierarchical RL experiment identifier: UPF project, MPN policy, α=0.6, n_components=64, γ=0.9, 50 episodes per epoch, 50 epochs, version 2  
    --hrl_algo="dqn" \                # HRL algorithm: Deep Q-Network  
    --hrl_timesteps=100000 \          # total number of environment interaction steps (1×10⁵)  
    --hrl_eval_freq=1000 \            # evaluate policy and log metrics every 1 000 steps  
    --hrl_nc=64 \                     # number of mixture components or network width parameter (64)  
    --hrl_std=0.1 \                   # exploration noise standard deviation (σ = 0.1)  
    --ac_model="mpf" \                # actor–critic network architecture: mixture policy network 
    --env="FetchReach-v2" \           # Gymnasium FetchReach continuous-control benchmark  
    --path="data/exp_name/exp_name_s0/pyt_save/model.pt"  # path to pretrained agent checkpoint  
```

With deterministic components:

```bash
python hrl/main.py \
    --hrl_exp_name="exp_name" \  # hierarchical RL experiment identifier: UPF project, MPN policy, α=0.6, n_components=64, γ=0.9, 50 episodes per epoch, 50 epochs, version 2  
    --hrl_algo="dqn" \                # HRL algorithm: Deep Q-Network  
    --hrl_timesteps=100000 \          # total number of environment interaction steps (1×10⁵)  
    --hrl_eval_freq=1000 \            # evaluate policy and log metrics every 1 000 steps  
    --hrl_nc=64 \                     # number of mixture components or network width parameter (64)  
    --hrl_hard \                      # Deterministic
    --ac_model="mpn" \                # actor–critic network architecture: mixture policy network  
    --env="FetchReach-v2" \           # Gymnasium FetchReach continuous-control benchmark  
    --path="data/exp_name/exp_name_s0/pyt_save/model.pt"  # path to pretrained agent checkpoint  
```
## Train PPO over random AQ:

Gaussian components:
```bash
python -m hrl.main \
  --hrl_exp_name="ant_px_baseline_gaussian_nc_4" \   # experiment name: Ant environment, baseline Gaussian, 4 components
  --hrl_algo=ppo \                                   # HRL algorithm: Proximal Policy Optimization
  --hrl_timesteps=30000000 \                         # total training timesteps: 30 million
  --hrl_eval_freq=10000 \                            # evaluation frequency: every 10,000 steps
  --hrl_gstd=0.2 \                                   # Gaussian standard deviation for generating components: 0.2
  --hrl_nc=4 \                                       # number of mixture components: 4
  --hrl_hard \                                       # use deterministic (hard) components
  --env="ant_px_gaussian"                            # environment: Ant with Gaussian Components 
```

Uniform components:
```bash
python -m hrl.main \
  --hrl_exp_name="humanoid_baseline_uniform_nc_4" \   # experiment name: Humanoid environment, baseline Uniform, 4 components
  --hrl_algo=ppo \                                    # HRL algorithm: Proximal Policy Optimization
  --hrl_timesteps=10000000 \                          # total training timesteps: 10 million
  --hrl_eval_freq=10000 \                             # evaluation frequency: every 10,000 steps
  --hrl_nc=4 \                                        # number of mixture components: 4
  --hrl_hard \                                        # use deterministic (hard) components
  --env="Humanoid-v5_uniform"                         # environment: Humanoid with Uniform Components
```


```bash
python -m hrl.main --hrl_exp_name="ant_px_baseline_gaussian_nc_4"   --hrl_algo=ppo --hrl_timesteps=30000000 --hrl_eval_freq=10000 --hrl_gstd=0.2 --hrl_nc=4   --hrl_hard --env="ant_px_gaussian"
```
# Discrete MDP
check the jupyter file: discrete/main