# Near-Optimal Sample Complexities of Divergence-based S-rectangular Distributionally Robust Reinforcement Learning

This repository implements the  numerical experiments as presented in ***Near-Optimal Sample Complexities of Divergence-based S-rectangular Distributionally Robust Reinforcement Learning***.

## Overview

### inventory

We use the ***Fast Bellman Updates*** introduced by Ho et al.

>Ho, C. P., Petrik, M., & Wiesemann, W. (2018, July). Fast Bellman updates for robust MDPs. In *International Conference on Machine Learning* (pp. 1979-1988). PMLR.

![image-20250330012723040](imgs/README/fast-bellman-update.png)

>Yang, Wenhao, Liangyu Zhang, and Zhihua Zhang. "Toward theoretical understandings of robust markov decision processes: Sample complexity and asymptotics." The Annals of Statistics 50.6 (2022): 3223-3248.

where
$$
q_{s,a}^{-1}(u,v) = \min_{\begin{gathered}P_{s,a}\in\Delta(S)\\R(s,a)+\gamma P_{s,a}v\leq u\end{gathered}}~D_f(P_{s,a}\|\overline{P}_{s,a})
$$


$$
D_{\rm KL}(P\|Q) = \sum_i P(i)\log\frac{P(i)}{Q(i)}\\
D_{\rm f}(P\|Q) = \sum_i Q(i)f\left(\frac{P(i)}{Q(i)}\right)
$$

For every state, we need to compute the feasible set for $u$, which is $[u_\min,u_\max]$
$$
u_{\min}=\max_{a\in\mathcal{A}} ~ u_{a,\min}\\
u_{\max}=\max_{a\in\mathcal{A}} ~ u_{a,\max}
$$

<img src="imgs/README/q.png" alt="q" style="zoom: 50%;" />


### MDP instances from the lower bound construction in Yang et al

![image-20250518233837196](imgs/README/mdp-instance.png)

The optimal value function satisfies
$$
\begin{align}
v^*(s) = &\max_{\pi}\inf_{P_{s,a}}1+\gamma\sum_{a\in\mathcal A}\pi(a|s)P_{s,a}v(s)^*\\
{\rm s.t.}&\sum_{a\in\mathcal{A}}D_f\left(\begin{bmatrix}P_{s,a}\\1-P_{s,a}\end{bmatrix}\Bigg\|\begin{bmatrix}\bar{P}_{s,a}\\1-\bar{P}_{s,a}\end{bmatrix}\right)\leq\delta
\end{align}
$$
which is equivalent to
$$
v(s) = \frac{1}{1-\gamma P_{s,a}}\\
{\rm s.t.} D_f(P_{s,a}\|\bar P_{s,a})\leq \delta
$$

## Directory Structure

```
.
├── inventory.py            # Main script for inventory control problems
├── general_mdp.py          # Main script for MDP instances
├── inventory_simulator.py  # Simulator class for inventory environment
├── MDP_mp.py               # DR-MDP solvers
├── plot.py                 # Scripts to generate paper figures
├── params.txt              # SLURM parameter list for inventory experiments
├── general_mdp_params.txt  # SLURM param list for general mdp experiments
├── imgs/                   # Images and figures (for README and papers)
├── logs/                   # Log files for experiments
├── results/                # Output results and pickles
└── README.md               # This file
```

## Installation

```bash
conda create -n DRMDP python=3.12 pyomo ipopt tqdm
conda activate DRMDP
pip install matplotlib numpy pandas scipy seaborn
```

- **Solver:** You will also need [IPOPT](https://coin-or.github.io/Ipopt/) installed and available in your path. Update the `executable_path` variable in `general_mdp.py` and `MDP_mp.py` to match your IPOPT binary location.

## How to Run

### Inventory Control (Section 5.1):

For batch experiments (uses SLURM, see below for running locally):

```bash
sbatch inventory.sh
```

Where `inventory.sh` is:

```bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task 8
#SBATCH --array=1-20
module purge
module load anaconda3
source activate DRMDP
echo $CONDA_DEFAULT_ENV
task_id=${SLURM_ARRAY_TASK_ID}
params=$(sed -n "${task_id}p" params.txt)
n=$(echo $params | awk '{print $1}')
run=$(echo $params | awk '{print $2}')
python inventory.py -n=${n} -run=${run}
```

### General MDPs from Lower Bound Construction (Section 5.2):

```bash
sbatch general_mdp.sh
```

Where `general_mdp.sh` is:

```bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task 32
#SBATCH --array=1-361
#SBATCH --output=logs/general_mdp/%x_%j.out
module purge
module load anaconda3
source activate DRMDP
echo $CONDA_DEFAULT_ENV
task_id=${SLURM_ARRAY_TASK_ID}
params=$(sed -n "${task_id}p" general_mdp_params.txt)
S=$(echo $params | awk '{print $1}')
A=$(echo $params | awk '{print $2}')
python general_mdp.py -S=${S} -A=${A}
```

### Plotting

To reproduce the main figures (after running experiments):

```bash
python plot.py
```

## Output

- Results are saved as `.pkl` files in `results/inventory/sample/` and `results/general_mdp/`.
- Log files are output in the corresponding `logs/` directories.

