# RL Experiments

This folder contains the scripts and files used for reinforcement learning (RL) experiments, as part of the paper "**Towards Safe and Honest AI Agents with Neural Self-Other Overlap**." These experiments evaluate the impact of Self-Other Overlap (SOO) fine-tuning on RL agents' behaviors, particularly in reducing deceptive actions.

### Dependencies

You can install the required dependencies using the following command:

---
pip install -r requirements.txt
---

## Repository Structure

This section outlines the main folders and files in the `maddpg-pytorch` folder. Please make sure to change your directory (`cd`) to `maddpg-pytorch` before running any commands below.

1. **maddpg-pytorch Folder**

   This folder contains the core codebase for the RL experiments.

2. **algorithms**

   Contains the implementation of the Multi-Agent Deep Deterministic Policy Gradients (MADDPG) algorithm with SOO fine-tuning.

3. **assets**

   Holds assets used for visualizations and plotting.

4. **models**

   Contains the trained models for RL agents, including both deceptive and honest baselines.

5. **utils**

   Utility scripts for managing environments and running experiments.

6. **train.py**

   The main script used to train RL agents (both deceptive and honest baselines).

7. **plot_quantitative_analysis.py**

   Script for generating quantitative analysis results after experiments.

8. **plot_parameter_sweep.py**

   Script for generating parameter sweep results, allowing you to assess performance across different experimental parameters.

9. **evaluate-deception.py**

   Script for evaluating deceptive behavior after the SOO fine-tuning.

10. **multiagent-particle-envs (RL Environment Folder)**

    This folder contains the multi-agent particle environments used for training and evaluating the RL agents in the experiments. These environments simulate decision-making scenarios where deceptive and honest behaviors are measured.

## Running the Experiments

Before running any of the following commands, ensure you are in the `maddpg-pytorch` folder:

---
cd maddpg-pytorch
---

### 1. Train Honest and Deceptive Baselines

To train the honest and deceptive baselines, use the following command:

---
python train.py simple_adversary MADDPG --n_episodes 40000 --episode_length 50
---

Make sure to check `simple_adversary.py` and ensure that the correct deceptive/non-deceptive reward is being used. Use `python train.py --help` for more information.

### 2. Perform SOO Fine-Tuning on the Deceptive Baseline

Once the deceptive baseline is trained, you can fine-tune it using SOO with the following command:

---
python train.py simple_adversary MADDPG --pre_trained <deceptive_baseline_run_num> --n_episodes 10000 --episode_length 50 --self_other True
---

Ensure the correct deceptive reward is being used in `simple_adversary.py` before running this command.

### 3. Generate Classification Results for Deceptive and Honest Baselines

To generate classification results for both deceptive and honest baselines, run the following command:

---
python improved-evaluate-deception.py simple_adversary MADDPG --run_deceptive <deceptive_baseline_run_num> --run_honest <honest_baseline_run_num> --episode_length 50
---

### 4. Visualizing Results

#### Parameter Sweep Results

Use `plot_parameter_sweep.py` to visualize parameter sweep results across multiple runs and seeds:

---
python plot_parameter_sweep.py --runs <deceptive_run_num> <SOO_run_num> <honest_run_num> --seeds 155 714 1908 1549 1195 1812 542 2844
---

#### Quantitative Analysis Results

To generate quantitative analysis results, use the following command:

---
python plot_quantitative_analysis.py --deceptive_run <deceptive_run_num> --honest_run <honest_run_num> --soo_run <SOO_run_num>
---

## Notes

- Ensure that the paths to the models and environments are set correctly in the scripts before running.
- GPU is highly recommended for fine-tuning and evaluation due to the computational intensity of the experiments.
- Run experiments with multiple seeds to ensure reproducibility.
