## 👋 Overview <a name="overview"></a>
This paper introduces ${\text{RL}}^2\text{EF}$, a framework that incorporates fine-tuning of large language models (LLM) with reinforcement learning (RL) in the application of the dynamic treatment regime (DTR). Within the RL training framework, ${\text{RL}}^2\text{EF}$ makes use of indications from the DTR environment for ''RL with Environment Feedback'' (RLEF) fine-tuning to achieve best-of-both-world results. Experimental results show that ${\text{RL}}^2\text{EF}$ LLM agent outperforms existing RL policies on the SimGlucoseEnv treatment regime task, enhancing sampling efficiency, generalizability, and interpretability. In addition to improving DTR performance, ${\text{RL}}^2\text{EF}$ improves LLM's question-answering ability on MMLU-Med, MedQA and MedMCQA benchmarks.

### ✨ Framework <a name="aci"></a>
To apply the heuristics of ${\text{RL}}^2\text{EF}$ in SimGlucoseEnv, we introduce LLM-${\text{RL}}^2\text{EF}$ architecture containing of two pre-trained LLMs: ActorLM and InstructLM. The frozen ActorLM acts as a decision-maker with considerable capabilities in prior knowledge embedding and reasoning, while the LoRA-tunable InstructLM is designed to extract specific dynamic characteristics of MDP environment and summarize decision rules from time-series interaction history with SimGlucoseEnv, providing expert prompts to instruct ActorLM towards better decision making.

The overall objective function of ${\text{RL}}^2\text{EF}$ training could be formulated as:

$$
\arg \max_{\theta_I} \sum_{t=1}^{T} \mathbb{E}_{s_t \sim \mathcal{P}(s_{t-1},a_{t-1})} \left[ \gamma^t R(s_t, \pi_{\theta_A}(S_A, \mathcal{H}^t_{t-T}, \mathcal{P})) \right]
$$
subject to
$$
\mathcal{P} = \pi_{\theta_I}(S_I, \mathcal{H}^t_{t-T}).
$$

In practice, LLM-${\text{RL}}^2\text{EF}$ is trained with either PPO or GRPO. Reward after each step of interaction with DTR is passed through a pre-trained general linguistic reward model, which assigns rewards on a per-token level.

## 🚀 Setup <a name="setup"></a>
1. [Install Miniconda](https://docs.anaconda.com/free/miniconda/miniconda-install/)
2. Configure the training environment
```bash
conda create -n LLM4RL python=3.11 -y
pip install -r requirements.txt
```

3. Configure the evaluation environment. Install THREE separate Python environments for evaluation tasks.

### 👩‍💻 Experiments <a name="inference"></a>
* Model training. Run the script below.

```bash
python run_main.py
```

* Evaluation. Run the script below.

```bash
python run_benchmark.py
```

## 🪪 License <a name="license"></a>
APACHE 2.0. Check `LICENSE`.
